The paper that introduced RAG, explained simply

If you've heard that an AI tool can "answer questions about your documents," you've heard about RAG — retrieval-augmented generation. The term comes from a 2020 paper by Facebook AI researchers, *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*. Here's the idea without the jargon.

(Our plain-language summary; the paper is linked so you can read the original.)

The problem it set out to solve

A language model only knows what it absorbed during training. That knowledge is frozen at a point in time and, more importantly, it never included your private material. Ask a stock model about your contracts, your product documentation, or yesterday's support tickets and one of two things happens: it admits it doesn't know, or — far more dangerously — it confidently invents a plausible-sounding answer. The second failure is what people mean by "hallucination," and in 2020 it was the main thing standing between impressive demos and trustworthy products.

The obvious fix at the time was to bake the knowledge into the model itself by fine-tuning it on your data. But that is slow, expensive, and goes stale the moment a document changes. The paper proposed a cleaner separation.

The core idea: an open-book exam

RAG splits the job in two, much like the difference between a closed-book and an open-book exam:

1.Retrieve. When you ask a question, the system first searches a library of your documents and pulls back the handful of passages most relevant to the question.
2.Generate. It hands those passages to the language model along with your question and asks it to answer using that text — not just its frozen memory.

The shift in role is the whole point. Instead of a know-it-all answering from memory and hoping it's right, the model becomes a skilled writer working from sources you handed it. Its job changes from "recall the fact" to "read these passages and compose a grounded answer," which is something language models are genuinely good at. It's the difference between asking a clever colleague to answer off the top of their head and asking them to answer after you've slid the relevant file across the desk — same person, far more reliable result.

A concrete scenario

Imagine an internal assistant for your support team. A customer asks whether a product is covered under warranty after 18 months. With RAG, the system first searches your warranty policy documents, retrieves the two paragraphs about coverage periods and exclusions, and feeds those to the model. The answer comes back as "Yes, this is covered for 24 months, see the Limited Warranty section" — and you can show the user exactly which passage it leaned on. Update the warranty PDF next week, and the very next answer reflects the new terms with no retraining at all.

Behind the scenes, the retrieval step usually works by turning both your documents and the incoming question into lists of numbers — "embeddings" — that capture meaning rather than exact wording. The system then finds the document chunks whose meaning sits closest to the question, which is why RAG can match "what's the cover period?" to a passage that never uses the word "cover." That semantic matching is the quiet engine that makes the open-book approach feel intelligent rather than like a keyword search.

Why it became the default

No retraining. You can point RAG at your latest documents without the expensive, slow process of fine-tuning a model on every change.
Fewer made-up answers. Grounding responses in retrieved text reduces hallucination — and lets you show citations so a human can verify the source.
Easy to update and govern. Change a document and the next answer reflects it instantly; remove a document and the model can no longer cite it. That makes permissions and freshness far easier to reason about.

RAG is why "chat with your data" went from research demo to standard product feature in just a few years.

The catch

RAG is only as good as its retrieval step. If the search pulls the wrong passages — or misses the right one entirely — the model will answer fluently and confidently from the wrong source, and it has no way of knowing it was handed bad material. In our experience, the large majority of "our RAG isn't working" complaints are really search problems in disguise: poor chunking of documents, weak embeddings, missing metadata, or queries that don't match how the source text is phrased. Fixing retrieval quality — not swapping in a bigger model — is usually where the real wins are.

When to reach for something else

RAG shines when knowledge changes often and answers must be grounded and citable. It's a weaker fit when you need the model to adopt a consistent style, follow a complex internal procedure, or perform a narrow task very reliably — those are jobs where fine-tuning can earn its keep. We lay out the trade-offs in RAG vs fine-tuning, and the two are often combined rather than chosen between. If you're weighing which fits your data, we're happy to help you scope it. Looking forward, retrieval is steadily merging into agentic systems that decide when and what to look up — but the open-book principle at the heart of this paper stays the same.

Sources

Lewis et al. (2020) — *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*

Written by Zain Ali

Start a project →

The paper that introduced RAG, explained simply

The problem it set out to solve

The core idea: an open-book exam

A concrete scenario

Why it became the default

The catch

When to reach for something else

Sources

Keep reading

“Attention Is All You Need”, explained for non-engineers

The METR study, explained: why AI made experienced developers slower

Chinchilla and the scaling laws: why bigger models aren’t always better