Is your data RAG ready?

Sarah Packowski
9 min readJust now

--

To get value from productivity-enhancing or customer-support RAG solutions, you’ll need to update your data management practices.

A new kind of data, powering solutions in a new way

Throughout the AI boom of the past ~20 years, people often repeated the saying “data is the new oil” (attributed to British mathematician, Clive Humby.) The point of that statement is that information technology solutions rely on — could not function without — data that has been properly refined and processed.

For much of that time, “data” meant: a mountain of labeled, historical data used to train AI models.

More recently, for training large language models (LLMs), “data” has meant: terabytes of unstructured text, usually scraped from the internet.

Now, in the era of retrieval-augmented generation (RAG) and agentic LLM solutions, a new kind of “data” is becoming critical for success: strategic content that is pulled into prompts at run time.

Example

Imagine you have invented a new hand-held writing implement, called the carbonWrite 9000. Because you just invented it, there’s no way information about your new product could be in the pre-training data that was used to train any LLM. As a result, if you asked an LLM about the carbonwrite 9000, you would not get an accurate answer:

Prompting an LLM in watsonx ai
Prompting an LLM with a generic question-answering prompt (Prompt Lab — IBM watsonx.ai)

However, if you include a snippet from the carbonWrite 9000 documentation in your prompt, you’ll get accurate answers:

Prompting an LLM in watsonx.ai
Prompting an LLM to generate an answer grounded in given article text (Prompt Lab — IBM watsonx.ai)

In this example, the product documentation is the strategic content that is pulled into the prompt at run time. Without that strategic content, an LLM question-answering solution could not succeed.

A new kind of data is becoming critical for success: strategic content that is pulled into LLM prompts at run time.

Data management — current state

Companies and institutions already have extensive processes and elaborate tools for data management. Management of: inventory, customer lists, sales history, patents filed, employee information, senior leaders’ blog posts, user activity on web pages, advertising content, patient medical information, product documentation, student academic records, taxes filed, customer surveys, music playlists, bus route schedules, and so on.

Teams who are responsible for managing data sources like the ones listed above have two responsibilities that are sometimes at odds with one another:

  • Keep the content secure and apply effective governance processes
  • Make the content available to derive (business) value from it
Two sides of the data management coin: security & governance, and business value
Two sides of the data management coin: security & governance, and business value

Security and governance are obviously incredibly important. The consequences of a data breach can be catastrophic. The risk to a company or an institution of not keeping their data secure is extreme.

However, the consequences of not implementing processes and tools to get value from data could also be existential: a company might be out-competed and go out of business, an institution might become so mired in inefficient processes and poor service to its users that it ceases to function.

Teams who manage these data sources need to do both: keep the data secure, and facilitate getting value from that data. Currently, because of the fear of legal fallout and because it’s a lot of work to make the data safely available for use, the focus is almost exclusively on security.

Often, data management is focused exclusively on security.

RAG readiness

For RAG and agentic LLM solutions to be able to use strategic content in their prompts, those solutions need to be able to quickly search for and extract relevant data at run time. RAG readiness refers to how well-suited a data source is to being used by RAG solutions this way.

Example

Imagine your online seed company has a website where customers can read about different seeds for sale and read articles about gardening. Your website has a search feature, and you save the search queries as they are submitted, with a timestamp but with no other metadata. Every month, your sales and content teams review historical search queries to understand what seeds customers are looking for the most and what are the most common questions. The process for reviewing those historical search queries is this: someone emails the administrator of the database where the historical searches are stored, and then the administrator extracts the data from the database into a spreadsheet, with all the search queries ordered by timestamp. Finally, your team uses Microsoft Excel to analyze the queries.

(They’re no Micheal Jarmon. But still, they have some good Excel skills.)

Anyway, WWE-style trophy belts aside…

Now, your team wants to improve the website search experience by displaying similar searches or anticipating searches based on patterns in past searches. Without getting into the weeds of the functionality or how to build this, suffice it to say the search functionality in your website needs run-time access to historical search queries.

Here are just a few of the challenges of building the new search enhancements with the existing historical-search-queries data management process:

  • Because the assumption has been that historical search queries would be collected for analysis in bulk, once per month, the database is optimized as a data warehouse, not for rapid-fire transactions. This means your new search feature might not get fast results querying that database directly.
  • Because the process is to ask an admin to extract the historical search queries, database access has been simple: only two admins have any access. Also, because the assumption was that an admin would be extracting the data, the schemas and tables sometimes contain both business confidential information and historical search queries. This means setting up read-only access to just the historical search queries for your new search feature would be difficult (eg. require restructuring the database.)
  • In an effort to avoid the unnecessary collection of personal information, only query strings and timestamps are being saved. In other words, there’s nothing like a session ID being saved. So there’s no way to tell when multiple searches are submitted during the same visit to the website; and there’s no way to see patterns like “searches for X are usually followed by searches for Y”. This means you could not build some of the pattern-matching functionality in your new search feature without refactoring the historical search query collecting functionality and adding a new column in the database for a session ID.
  • Speaking of avoiding personal information, when the admins extract historical search queries, they do a quick manual sweep to remove any personal information from the extracted data. (You’d be surprised what people type into fields on websites. Some visitor will definitely type in the search bar: “Can you tell me when my order will be shipped? My customer ID is name@domain.email and my phone number is 123 - 45678”) This means your new search functionality cannot be given access to read the raw historical search queries from the database in its current form, because there is a risk of exposing previous website visitors’ personal information.

These challenges are not insurmountable. But this fictitious example demonstrates that a process originally designed for infrequent, bulk analysis of historical data will need to be updated to make that data safely, reliably available at run time for RAG and agentic LLM solutions. You’ll need to look at how the data is collected and stored, what details are collected, access rules, how the data is cleaned and anonymized, add new monitoring of how the data is being used by LLM solutions, and so on.

A processed designed for infrequent, bulk analysis of historical data will need to be updated to make that data available for LLM solutions at run time.

RAG readiness evaluation criteria

To measure and systematically improve the RAG readiness of data sources, we need evaluation criteria. Consider the following 5 levels of data source RAG readiness, differentiated by data source attributes. (The levels are cumulative. Data sources at level N+1 have all the advantages of level N and more.)

Level 1: No RAG value

A business or institution cannot get value from a level 1 data source through its use in RAG or agentic LLM solutions.

Attributes

  • Most employees cannot get authorization to access the data
  • The process for requesting authorization is onerous
  • Using the data requires requesting someone else provide it to you
  • The organization of the data is difficult to navigate and not documented
  • The data cannot be downloaded in bulk or accessed programmatically
  • The data has not been cleaned of sensitive information
  • The format of the data requires extensive processing (e.g. .mp4 files, PDFs, unwieldy HTML files)

Even setting up demonstrations or proof-of-concept (POC) RAG solutions grounded in a level 1 data source is not containable.

Level 2: Some RAG value

A business or institution will not get long-term benefit from RAG solutions grounded in level 2 data, because the manual maintenance of keeping those solutions current is not sustainable.

Attributes

  • Most employees can get authorization to access the data
  • There is a basic method for downloading/copying the data (e.g. a web page where data can be downloaded in bulk with the click of a button)
  • The method for downloading/copying data is documented

RAG solution builders could manually download content from this source and then index it in their own search solution to build RAG POCs.

Level 3: Medium RAG value

A business or institution can benefit from RAG solutions grounded in level 3 data sources. But the cost of maintaining those solutions will be high enough that many of them will be abandoned.

Attributes

  • There is a straightforward (takes one day or less) and transparent (documented) process for getting authorization to access the data
  • The data does not contain sensitive information (e.g. confidential information and personal information has been removed or anonymized)
  • There’s a method for programmatically accessing the data (e.g. API)
  • There is a straightforward and transparent process for getting help with authorization or data access problems

RAG solution builders still need to index data from a level 3 data source in their own search component, but at least that can be automated. This makes it faster and easier to build RAG POCs, and makes it possible to move some solutions into production.

Level 4: High RAG value

The business will get high value from a level 4 data source because it’s easily incorporated into RAG solutions that increase productivity and reduce support costs.

Attributes

  • The process for getting authorization is automated and immediate
  • There is a search API that RAG solution builders can use, and the latest data is automatically, continuously indexed
  • There is a process for opening defects when data is missing, search fails, or data needs to be updated

With a level 4 data source, RAG solution builders can stand up demonstrations and POC RAG solutions in an afternoon, and move on to production RAG solutions in a matter of days. The search API makes it straightforward to add this data source to a RAG solution without having to build a search component.

Level 5: Winning at RAG!

A business or institution will gain an industry advantage from a level 5 data source, as teams create and maintain individual RAG solutions that can also work together to answer questions and carry out users’ requested actions. A level 5 data source might also be a direct source of revenue.

Attributes

  • Data owners prioritize RAG readiness and are proactive about getting input from RAG solution builders
  • Search and data APIs are available that align with emerging best practices — such as the data API returning data in prompt-ready plain text that retains meaning (e.g. images and tables)
  • There is a straightforward, speedy, and transparent process for requesting fixes and enhancements to the search and data APIs
  • There is a place where a community of users can share their projects that use this data and lessons learned
  • Ownership, lineage, and update history is programmatically available for assets (e.g. videos, articles, documents)
  • The data is available for external users to access (Customers of company X might build employee-training RAG solutions for their own processes that use company X software. Or residents of a city might build RAG solutions on announcements from city hall.)

With level 5 content sources, RAG solution builders — internal to the organization and external users — can effectively troubleshoot and fix problems, easily collaborate on solutions, and build infrastructure to support orchestrating individual RAG solutions contributing to multiple conversations and routing conversations between solutions.

To measure and systematically improve the RAG readiness of data sources, we need evaluation criteria.

Conclusion

Data security is paramount. But to get the benefits of productivity-enhancing and customer-support RAG and agentic LLM solutions, we must also do the work required to make data sources safely and reliably available for those solutions to pull strategic content into LLM prompts at run time.

An unbalanced approach is no longer acceptable.

“Le Chat déambule”, 20 sculptures by Philippe Geluck, Wilson Quay, Geneva
“Le Chat déambule”, 20 sculptures by Philippe Geluck, Wilson Quay, Geneva (source)

--

--

Sarah Packowski
Sarah Packowski

Written by Sarah Packowski

Design, build AI solutions by day. Experiment with input devices, drones, IoT, smart farming by night.

No responses yet