How publishers need to adapt in the era of RAG

5 min readAug 31, 2024

Instead of resenting that people are scraping your content, embrace them as customers.

The goose that laid the golden eggs. Source: Wikimedia Commons

What if RAG solutions are actually a gift to publishers?

For starters

First, let’s clarify some terms: What is RAG? and Who are “publishers”? Also, let’s address the intellectual property elephant in the room.

What is RAG?

Retrieval-augmented generation (RAG) is a technique for using a large language model to generate reliable, accurate output by grounding that output in content from a knowledge base.

For example, to use RAG to answer a question about a software product, you would perform the following steps:

Search the online product documentation for an article that contains information to answer the question
Extract the HTML contents of that article and convert it to plain text
Submit the following prompt to a large language model:

Article:
------
<article-text-here>
------

Answer the following question using only information from the article. 
If there is no good answer in the article, say "I don't know".

Question: <user-question-here>
Answer:

Who are “publishers”?

In this blog post, publishers includes anyone who makes any type of content available to consumers. For example:

Anyone hosting HTML pages that are accessed with HTTP GET requests by web browsers
Anyone publishing physical or electronic books
Anyone producing education or training material for in-person or online courses
Anyone disseminating art (including visual art, animation, and music) that is available in physical or electronic form, including online
Anyone curating academic papers

In this blog post, content creators are the authors, artists, and musicians who write the stories, lyrics, scripts, and reference information, or who create the images, videos, and music. Creators can also be publishers. Many creators are dependent on publishing platforms (like Medium.) In this blog post, “publishers” refers to creators too.

In this blog post, consumers are people who view, read, watch, listen to, link to, reuse, or otherwise engage with content.

Problematic history of generative AI

Publishers have a right to be angry about the way tech companies (and some researchers) have viewed any and all content they can scrape, scan, or download as fair game for training AI. That behavior is indefensible. Hopefully, as lawsuits make their way through the courts, publishers will be compensated for historical injustices and protected from future theft.

Because of that problematic history, publishers are — understandably — defensive about any use of their content for AI. But publishers need to understand there are many different AI use cases.

Publishers need to make a distinction between people greedily gobbling up content for training models and people selectively accessing content for RAG solutions

Unlike scraping content for AI training, the use of content in RAG solutions presents an incredible, lucrative opportunity for publishers.

The evolution of content 1·2·3

Remember when having an easy-to-remember URL was incredibly important because there were no search engines? Needs shift.

1. The simple past

Consider the content landscape before ~2022.

The consumer perspective:

Expected publishers to make web content available for web browsers (and popular e-readers, music players, and streaming apps)
Expected free access to content, but accepted that some content would include ads, affiliate links, or be behind pay walls
Wanted relevant content to be easy to find
Saw value in the published asset (eg. the article or video) and might save or share links to high-value assets
Manually viewed, read, or listened to content, usually from beginning to end, often consuming most or all of an article, video, or song

The publisher perspective:

Wanted consumers to stay on their site
Viewed features like search, audio interfaces, pleasing visual layout, and related links as a way to attract and retain consumers on their site

2. Current state

Here’s what’s changed, now that RAG is emerging as a killer application for large language models.

First, consider the intrepid RAG solution builders:

Unfortunately still often view any and all content as fair game
Programmatically access the same endpoints set up for web browsers
Must scrape hard-to-access content and convert it to text
Repeatedly crawl and scrape entire sites to index content in the search component of their RAG solution to keep their knowledge base current
Struggle to effectively use long-form content (originally intended to be consumed from beginning to end in its entirety) for short answers

Now consider the beleaguered publishers:

Their content is absorbed into RAG solutions with no attribution
Their servers get bogged down by bursts of heavy traffic
Not making any money from the RAG solutions using their content
Because RAG solutions give consumers the information they need, consumers don’t visit the published site
Waste time trying to prevent their content being scraped

Consumers are being set up to fail too:

Are quickly becoming used to asking natural language questions (instead of keyword searches) and expecting natural language answers
But LLM-generated output is often being presented without a paper trail of links to sources, making it no better than an ephemeral rumor

3. Potential future state

What if the conflict between RAG solution builders and publishers went away? What if publishers benefited from all those RAG solutions?

One answer is for publishers to provide paid search, content, and question-answering APIs.

Publishers win:

RAG solutions become a source of revenue for publishers
Can set beneficial terms of service for RAG solutions (eg. requiring RAG solutions to link to source articles)
Can build features into their APIs that enable service improvements or additional monetization
Content that performs well will be used in more RAG solutions

RAG solution-builders win too:

Can rely on getting relevant search results and up-to-date content without having to crawl, scrape, convert, and index content for search
Content will become optimized for RAG

Consumers win big time:

Entirely new content paradigms will emerge, bringing trustworthy, verifiable information from diverse sources right to you

There’s also a sustainability argument to be made:

RAG solution builders often use AI models and vector databases to embed (convert text to vectors) and store content for search. This search implementation is particularly energy intensive. So it’s all the more crazy to have the same information be stored in many duplicate vector databases by different RAG solution builders.

The evolution from the simple past to a potential future — Evolution from the simple past to a potential future

We are in the messy present, where people are trying to use content in new ways, but publishing infrastructure isn’t yet there to support it.

Conclusion

Publishers, your content is more valuable than ever. Instead of being angry people are trying to use your content in their RAG solutions, make it easier for them to do that and charge them a fair price for it.

Kill not the goose that lays the golden eggs. Source: Wikimedia commons