Information typing is the professional writer’s secret ingredient for RAG success
Content strategy is needed more than ever in the era of RAG.
What is RAG?
Retrieval-augmented generation (RAG) is a technique for using a large language model (LLM) to generate reliable, accurate output by grounding that output in content from a knowledge base.
For example, to use RAG to answer a question about a software product, you would perform the following steps:
- Search the online product documentation for an article that contains information to answer the question
- Extract the HTML contents of that article and convert it to plain text
- Submit the following prompt to an LLM:
Article:
------
<article-text-here>
------
Answer the following question using only information from the article.
If there is no good answer in the article, say "I don't know".
Question: <user-question-here>
Answer:
Search is critical for RAG success
The first step in that RAG pattern, search, is critical. If you cannot find the relevant information in your knowledge base, your LLM won’t be able to generate a good result.
Many strategies have been proposed for implementing effective search. Some of those strategies are pretty elaborate. But a simple technique that can really boosts results is to use information typing.
What is information typing?
Topic-based writing involves breaking content up into small, complete, self-contained articles (called topics) that are each about only one subject. Information typing involves defining topic types according to the nature of their content or their purpose.
Here are three examples of topic types:
- Explain a concept
- Describe the steps to perform a task
- List detailed reference information
(You could decide to create other types too. Whatever makes sense for your content.)
Example 1: No information typing
Imagine you have invented a new hand-held writing implement, called the carbonWrite 9000. And imagine you have the following documentation to support people using the carbonWrite 9000. This documentation is one, long article with no information typing:
# carbonWrite 9000
Congratulations on purchasing the carbonWrite 9000!
## Introduction
Once you have sharpened the end, you can use the pencil to write
and draw on a variety of surfaces.
## Features
The carbonWrite 9000 has many state-of-the art features for
writing with different line widths, writing in the dark, and
erasing what you've written.
### Variable line widths
When the tip is dull, lines will be thick. When the end is sharp,
lines will be thin.
### Built-in lighting
The carbonWrite 9000 has an on-board light for writing in the dark.
### State-of-the art rubOut(TM) feature
If you purchased the optional rubOut eraser feature, you can erase
previous pencil output!
### Voice interface
You can submit administrative requests to the carbonWrite 9000
using voice commands.
## Administration
The carbonWrite 9000 battery has two modes:
- High performance, for faster response times and brighter light
- Long life, to extend the battery life as long as possible
### Command syntax
battery_config [ performance | longevity ]
Now, let’s imagine a RAG solution is deployed that uses the above content as the knowledge base. Here is a Python notebook with the RAG implementation:
Let’s say users submit the following questions to the RAG solution:
- What is carbonWrite 9000?
- What features does the carbonWrite 9000 have?
- Can I write on cardboard?
- How can I erase what I wrote?
- How can I make my battery last longer?
- I’m having trouble writing because the end is dull. What can I do?
Here are answers from our sample RAG implementation:
Not great, right?
Example 2: With information typing
Here’s what the carbonWrite documentation might look like when written as concept, task, and reference topics:
Topic 1 (concept)
# carbonWrite 9000
The carbonWrite 9000 is a pencil.
## Features
The carbonWrite 9000 has many state-of-the art features:
- The ability to produce different line widths
- An on-board light for writing in the dark, or in low light
- The rubOut feature for erasing what you've written
- A voice interface for administering your carbonWrite 9000
Topic 2 (concept)
# Writing surfaces
You can write and draw on a variety of surfaces with the carbonWrite 9000.
Supported writing surfaces include:
- Paper
- Cardboard
- Wood
Topic 3 (task)
# Sharpening your carbonWrite 9000
If you need more of the carbon core to be sticking out at the writing
end of your carbonWrite 9000, or if you want to write or draw thinner
lines, you can sharpen your carbonWrite 9000.
## Lengthening the carbon writing tip
If the carbon sticking out at the writing end of your carbonWrite
9000 gets too short, you can expose more of the carbon by unwinding
the material that surrounds the inner core:
1. Grasp the tail of the white string near the carbon tip.
2. Gently pull on the string, unwinding the material around the
circumference of the pencil.
3. Once the desired amount of carbon is exposed, use scissors to
cut the trailing string and any material attached to it.
## Shaping the carbon writing tip to a narrower point
If the tip of the carbon is dull or if the lines the pencil makes
are too thick, then you can sharpen the tip in one of two ways:
- Rub the sides of the carbon tip on any rough surface to sharpen it to a
narrower point.
- Use a sharp knife to whittle the carbon writing tip to a narrower point.
Topic 4 (task)
# Writing in the dark or in low light
The carbonWrite 9000 has an on-board light for writing in the dark.
You do not need to take any manual steps to use the light. When ambient
lighting gets below the hard-coded threshold, the lightbulb illuminates
automatically.
Topic 5 (task)
# Erasing what you wrote or drew
If you purchased the optional rubOut eraser feature, you can erase
previous pencil output:
1. Lift the pencil from the page
2. Invert the pencil and place the rubOut eraser on the page
3. Rub the eraser over the lines you want to erase until the
lines are gone
Topic 6 (task)
# Managing battery life
The carbonWrite 9000 battery has two modes:
- High performance, for brighter light
- Long life, to extend the battery life as long as possible
## Option 1: Voice interface
You can set the battery mode by speaking your requested mode
to the carbonWrite 9000 voice interface.
## Option 2: System command line
You can also set the battery mode by calling the battery_config
command from a system command line.
- To get the most life from your battery, call "battery_config
longevity"
- To get the brightest light, at the expense of shorter batter
life, call "battery_config performance"
Topic 7 (reference)
# battery_config command
To configure the carbonWrite 9000 battery mode, call
battery_config from a system command line.
## Syntax
battery_config [ performance | longevity ]
### Example 1: Configuring for performance
battery_config performance
### Example 2: Configuring for long battery life
battery_config longevity
When our RAG solution is updated to use these seven topics as the knowledge base, to filter search based on the appropriate content type for the question type, and to pull complete topics into LLM prompts, our RAG solution answers all 6 questions well:
Sample notebook
Here again is a link to a Python notebook demonstrating both examples:
Notes about these examples
These examples are pretty contrived, because they need to be simple enough to see at a glance while still demonstrating the key advantages of information typing. Keep the following points in mind when reflecting on these examples and experimenting with the sample notebook:
- Knowledge base size — That carbonWrite 9000 documentation is very small. You could just pull all of it into your prompt! But for real RAG solutions, the knowledge base might be thousands of pages long. You couldn’t paste that much content into a prompt, because there is a limit to the size of the input (the context window) and because of a tendency LLMs have of “getting lost in the middle” of long prompts. Despite it’s small size, though, this sample demonstrates patterns common in real-world knowledge base content.
- Classifying user input, filtering search by content type — A key advantage of constructing your knowledge base content using topic-based writing and information typing is that certain types of topics are better suited to answer certain types of questions. For example, a task topic is better suited to answer how-to questions than a concept topic.
- Parent document retrieval — A common search strategy for RAG solutions is to “chunk” content (based on word count or heading levels, for example) and then pull one or more relevant chunks into the LLM prompt. In the later part of the sample notebook, however, the whole topic is being pulled into the prompt. This strategy is sometimes called parent document retrieval. If a content professional has made the effort to put all relevant information into a complete, self-contained topic, it makes sense to take advantage of that information architecture by pulling the complete topic into your prompt instead of just some chunks of it.
- Writing style — A big difference between the “before” and “after” sample documentation is that the “after” content was broken up into individual topics. But the writing style is also different. The “before” content is focused on the product itself, whereas the “after”content is focused on user tasks. This user task-focused style of writing is both required for and a natural consequence of using information typing. And it just so happens that this style of writing is better for human readers and for RAG solutions.
- Question-driven content development — Another difference is that the language (word choice) in the “after” content better matches the language in the user questions. Also, in the “after” version of the content, there are more examples included. Instead of just saying you can write or draw on “a variety of surfaces”, the updated content includes specific examples of surfaces, including “paper and cardboard.” If you know what questions your users are asking, you can set your RAG solution up for success by using language in your knowledge base that matches users’ language and by including common examples.
Conclusion
Information typing has been around for decades. For example, the Darwin Information Typing Architecture (DITA) specification was introduced in 2001. But using knowledge base content strategy to improve RAG results has been overlooked in the literature, because AI research and development teams rarely include content professionals.
Using topic-based writing in the knowledge base, classifying user input and then filtering search for appropriate content types, and parent document retrieval all work together to improve RAG results — even with simple prompts and basic LLMs.