Sitemap

Testing RAG knowledge base content

5 min readMar 30, 2025

--

You can measure how well your knowledge base content will work even before your RAG solution is built!

What is RAG?

Retrieval-augmented generation (RAG) is a technique for using a large language model (LLM) to generate reliable, accurate output by grounding that output in content from a knowledge base.

For example, to use RAG to answer a question about a product, you would perform the following steps:

  1. Search the online product documentation for an article that contains information to answer the question
  2. Extract the HTML contents of that article and convert it to plain text
  3. Submit the following prompt to an LLM:
Article:
------
<article-text-here>
------

Answer the following question using only information from the article.
If there is no good answer in the article, say "I don't know".

Question: <user-question-here>
Answer:

What makes RAG work?

If you prompt an LLM to generate output related to a niche subject or related to the very latest information, LLMs can’t generate correct output on their own (ie. based only on the vocabulary and weights from their pre-training.) That’s because the data used to pre-train LLMs would not have contained very much (or any) related text. That’s why the RAG pattern is so useful: When you include accurate, up-to-date, domain-specific information as context in the prompt, the LLM can use that information to generate correct output. See: In-context learning (Wikipedia)

Example 1

Have you heard of the carbonWrite 9000 pencil? Probably not, because it’s only just been (pretend) invented! No pre-training data could possibly mention it. So, if you prompt an LLM about carbonWrite 9000, the model will not generate correct output:

Prompting an LLM with a generic question-answering prompt (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.
Prompting an LLM with a generic question-answering prompt (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.

Example 2

However, if you include a snippet from the carbonWrite 9000 documentation in your prompt, you will get accurate output:

Prompting an LLM to generate an answer grounded in given article text (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.
Prompting an LLM to generate an answer grounded in given article text (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.

Accurate, up-to-date, domain-specific content you pull into the prompt is the critical ingredient that makes RAG work.

When does RAG fail?

If you ask AI developers, architects, and researchers, they can tell you many potential causes of poor-quality or incorrect output being returned by RAG solutions, including: wrong chunk size, weak embedding model, vector database failure, search algorithm miss, poor re-ranking model, sloppy prompt engineering, sub-optimal language model architecture, language model too small, …

But they probably won’t mention problems with the content in the knowledge base itself. (The reason they don’t think of it is because there usually aren’t writers on those teams, and because people usually focus on their own area of expertise.) In contrast, every time I’ve worked on a RAG solution with content professionals, their first question is: “What should we be doing in the knowledge base to set this RAG solution up for success?”

The truth is, knowledge base content can make or break a RAG solution! When your content is optimized for this use, you can get great results with simpler, less-expensive solutions and smaller models. But if your content is lousy, the greatest large language model in the world can’t save you.

Example 3

If the same carbonWrite 9000 article from Example 2 is written with less detail, the same prompt doesn’t yield a precise answer:

Prompting an LLM to generate an answer grounded in weak article text (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.
Prompting an LLM to generate an answer grounded in weak article text (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.

If the information you pull into a RAG prompt is missing details, is out of date, or is written poorly, then pulling that information in won’t help the model generate correct output.

How to test your knowledge base

Teams test that their embedding model is working properly. They test that their documents are being ingested into their search component accurately. They confirm that their prompt follows prompt engineering best practices. They use benchmarks to test that the large language model performs well. And they submit questions to the running RAG solution to test that the application is working well, end to end.

But nobody remembers to test the knowledge base content itself!

One way to verify that your knowledge base content is doing its job is to submit sample input to the running RAG solution and then evaluate the accuracy and quality of the generated output. This is an ok process and teams run these end-to-end test already. But there are problem with this method:

  • You have to wait for the RAG solution to be running, which delays your ability to work on shortcomings in the knowledge base
  • There might be multiple RAG solutions using your content as a knowledge base — will you test them all?
  • Manual evaluation is time-consuming
  • Automated evaluation requires preparing expected output and isn’t always reliable
  • Failures might be caused by components other than the knowledge base

Matching generated questions with expected questions

For question-answering RAG solutions, there is another way to test your knowledge base content, one that approaches the problem from reverse:

  1. Collect real user questions (See: Question-driven content design)
  2. Prompt an LLM to generate questions answered by the knowledge base
  3. Compare the generated questions with the expected questions

Sample prompt
Here’s a sample prompt to generate questions answered by the text “The quick, brown fox jumps over the lazy dog.”

Prompting an LLM to generate questions answered by the given article text (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.
Prompting an LLM to generate questions answered by the given article text (Prompt Lab — IBM watsonx.ai) Generated output is highlighted in blue.

Sample results

Comparing LLM-generated questions with expected user questions
Comparing LLM-generated questions with expected user questions

Interpreting results

  • When generated questions match expected questions, that’s a sign a RAG solution grounded in your content can successfully answer those expected questions.
  • When generated questions don’t match expected questions, that’s a sign there is a problem with the content.

Content problems might include:

  • Missing information — a gap in subject matter, or missing details
  • Mismatch — between the terminology or wording of the expected questions and the content (eg. acronyms, synonyms, terminology from other domains, mental model underlying the user questions)
  • Structure or writing style — the LLM can’t identify important facts

This method is indirect. But the testing can be fully automated and will scale up for many questions and large knowledge bases.

Here’s a Python notebook demonstrating the complete technique:

Example tool and workflow

The following video demonstrates a sample Flask web app to help with applying this technique.

A sample Flask web app to help with testing whether a RAG solution will be able to answer expected questions using given knowledge base content.

Conclusion

Everyone is excited about RAG and AI agents, right now! But people often overlook the crucial role that knowledge base content plays in the success of those solutions. Testing that your content can be used to answer expected questions sets those solutions up for success.

Test your content to make RAG solutions using that content as a knowledge base more successful
Test your content to make RAG solutions using that content as a knowledge base more successful. (Scroll image: https://commons.wikimedia.org/wiki/File:Paper_Scroll_2.svg and feather pen image: https://commons.wikimedia.org/wiki/File:Quill_pen_transparency.png)

--

--

Sarah Packowski
Sarah Packowski

Written by Sarah Packowski

Design, build AI solutions by day. Experiment with input devices, drones, IoT, smart farming by night.

No responses yet