How publishers need to adapt in the era of RAG
Instead of resenting that people are scraping your content, embrace them as customers.
What if RAG solutions are actually a gift to publishers?
For starters
First, let’s clarify some terms: What is RAG? and Who are “publishers”? Also, let’s address the intellectual property elephant in the room.
What is RAG?
Retrieval-augmented generation (RAG) is a technique for using a large language model to generate reliable, accurate output by grounding that output in content from a knowledge base.
For example, to use RAG to answer a question about a software product, you would perform the following steps:
- Search the online product documentation for an article that contains information to answer the question
- Extract the HTML contents of that article and convert it to plain text
- Submit the following prompt to a large language model:
Article:
------
<article-text-here>
------
Answer the following question using only information from the article.
If there is no good answer in the article, say "I don't know".
Question: <user-question-here>
Answer:
Who are “publishers”?
In this blog post, publishers includes anyone who makes any type of content available to consumers. For example:
- Anyone hosting HTML pages that are accessed with HTTP GET requests by web browsers
- Anyone publishing physical or electronic books
- Anyone producing education or training material for in-person or online courses
- Anyone disseminating art (including visual art, animation, and music) that is available in physical or electronic form, including online
- Anyone curating academic papers
In this blog post, content creators are the authors, artists, and musicians who write the stories, lyrics, scripts, and reference information, or who create the images, videos, and music. Creators can also be publishers. Many creators are dependent on publishing platforms (like Medium.) In this blog post, “publishers” refers to creators too.
In this blog post, consumers are people who view, read, watch, listen to, link to, reuse, or otherwise engage with content.
Problematic history of generative AI
Publishers have a right to be angry about the way tech companies (and some researchers) have viewed any and all content they can scrape, scan, or download as fair game for training AI. That behavior is indefensible. Hopefully, as lawsuits make their way through the courts, publishers will be compensated for historical injustices and protected from future theft.
Because of that problematic history, publishers are — understandably — defensive about any use of their content for AI. But publishers need to understand there are many different AI use cases.
Publishers need to make a distinction between people greedily gobbling up content for training models and people selectively accessing content for RAG solutions
Unlike scraping content for AI training, the use of content in RAG solutions presents an incredible, lucrative opportunity for publishers.
The evolution of content 1·2·3
Remember when having an easy-to-remember URL was incredibly important because there were no search engines? Needs shift.
1. The simple past
Consider the content landscape before ~2022.
The consumer perspective:
- Expected publishers to make web content available for web browsers (and popular e-readers, music players, and streaming apps)
- Expected free access to content, but accepted that some content would include ads, affiliate links, or be behind pay walls
- Wanted relevant content to be easy to find
- Saw value in the published asset (eg. the article or video) and might save or share links to high-value assets
- Manually viewed, read, or listened to content, usually from beginning to end, often consuming most or all of an article, video, or song
The publisher perspective:
- Wanted consumers to stay on their site
- Viewed features like search, audio interfaces, pleasing visual layout, and related links as a way to attract and retain consumers on their site
2. Current state
Here’s what’s changed, now that RAG is emerging as a killer application for large language models.
First, consider the intrepid RAG solution builders:
- Unfortunately still often view any and all content as fair game
- Programmatically access the same endpoints set up for web browsers
- Must scrape hard-to-access content and convert it to text
- Repeatedly crawl and scrape entire sites to index content in the search component of their RAG solution to keep their knowledge base current
- Struggle to effectively use long-form content (originally intended to be consumed from beginning to end in its entirety) for short answers
Now consider the beleaguered publishers:
- Their content is absorbed into RAG solutions with no attribution
- Their servers get bogged down by bursts of heavy traffic
- Not making any money from the RAG solutions using their content
- Because RAG solutions give consumers the information they need, consumers don’t visit the published site
- Waste time trying to prevent their content being scraped
Consumers are being set up to fail too:
- Are quickly becoming used to asking natural language questions (instead of keyword searches) and expecting natural language answers
- But LLM-generated output is often being presented without a paper trail of links to sources, making it no better than an ephemeral rumor
3. Potential future state
What if the conflict between RAG solution builders and publishers went away? What if publishers benefited from all those RAG solutions?
One answer is for publishers to provide paid search, content, and question-answering APIs.
Publishers win:
- RAG solutions become a source of revenue for publishers
- Can set beneficial terms of service for RAG solutions (eg. requiring RAG solutions to link to source articles)
- Can build features into their APIs that enable service improvements or additional monetization
- Content that performs well will be used in more RAG solutions
RAG solution-builders win too:
- Can rely on getting relevant search results and up-to-date content without having to crawl, scrape, convert, and index content for search
- Content will become optimized for RAG
Consumers win big time:
- Entirely new content paradigms will emerge, bringing trustworthy, verifiable information from diverse sources right to you
There’s also a sustainability argument to be made:
- RAG solution builders often use AI models and vector databases to embed (convert text to vectors) and store content for search. This search implementation is particularly energy intensive. So it’s all the more crazy to have the same information be stored in many duplicate vector databases by different RAG solution builders.
We are in the messy present, where people are trying to use content in new ways, but publishing infrastructure isn’t yet there to support it.
Conclusion
Publishers, your content is more valuable than ever. Instead of being angry people are trying to use your content in their RAG solutions, make it easier for them to do that and charge them a fair price for it.