Retrieval-augmented generation (RAG), from first principles
RAG is often explained as a stack of tools. Strip that away and it is one simple idea: let the model read the right material before it answers. Here is how it really works.
Retrieval-augmented generation is usually introduced as a pipeline of products: an embedding model, a vector database, a retriever, a generator. That framing is backwards. RAG is one idea, and the tools are just one way to implement it. The idea: before the model answers, give it the specific material it needs to answer from. Everything else is engineering.
The term comes from a 2020 paper by Lewis and colleagues, which combined a retriever with a generator so that a model could pull in external passages rather than relying solely on what its weights had memorized. The motivation then is the motivation now, and understanding it from first principles will outlast any particular tool.
The problem RAG solves
A language model knows only what was in its training data, frozen at a point in time, blended into its weights. That creates two limits:
- It cannot know your private or recent information. Your company's documents, last week's release notes, the contents of a specific PDF — none of it is in the model.
- Recall from weights is lossy. Even for things "in" the model, exact details can come out wrong. The model is reconstructing from a compressed memory, not looking anything up.
RAG addresses both by changing the question from "what does the model remember?" to "what can we put in front of the model right now?" That shift — from memory to evidence — is the whole point.
The mechanism in three steps
At its core, RAG is three moves:
- Index. Break your source material into passages and store them so they can be searched. Most systems do this by converting each passage into an embedding — a vector that places similar meanings near each other — and storing those vectors.
- Retrieve. When a question comes in, find the passages most relevant to it. With embeddings, that means turning the question into a vector and fetching the nearest passages.
- Generate. Put the retrieved passages into the model's context along with the question, and ask it to answer using that material.
That is the whole arc. The model still writes the answer, but now it is reading from supplied evidence instead of recalling from memory. Everything sophisticated in RAG is a refinement of one of these three steps.
Why embeddings, and why they are not magic
Embeddings let you search by meaning rather than exact words, so a question about "time off" can retrieve a passage about "vacation policy" even with no shared keywords. That is genuinely useful, and it is why semantic search underpins most RAG systems. But two honest caveats:
- Semantic search is not exact search. For precise identifiers — a product code, a specific clause number, an error string — keyword search often beats embeddings. Many strong systems combine both, a pattern usually called hybrid search.
- Retrieval quality caps everything downstream. If step 2 returns the wrong passages, the model answers from the wrong material and sounds just as confident. This is the single most important fact about RAG, and it is the one demos hide.
Chunking: the unglamorous decision that decides quality
How you split your documents into passages quietly determines how well retrieval works. Chunks that are too long dilute relevance — the useful sentence is buried among unrelated ones, and the embedding becomes an average of too many ideas. Chunks that are too short lose the context that makes them meaningful. The durable advice is to chunk along natural boundaries — sections, paragraphs, logical units — rather than slicing at arbitrary character counts. Good chunking is boring work, and it pays off more than swapping in a fancier model.
What good RAG actually requires
The naive version — embed everything, fetch the top few, stuff them in — works in a demo and disappoints in production. The parts that make it real:
- Sensible chunking, as above.
- Enough, but not too much, context. Retrieving more passages is not always better. Irrelevant passages distract the model and push useful ones out of attention. There is a sweet spot, and it is usually smaller than people expect.
- Grounding instructions. Tell the model to answer only from the provided material and to say clearly when the material does not contain the answer. This is what turns retrieval into trustworthy answers instead of confident guesses.
- Showing sources. Returning which passages were used lets a human verify the answer — essential for anything high-stakes, and a quiet trust-builder everywhere else.
How to tell if your RAG is any good
Because most failures are retrieval failures, evaluate retrieval separately from generation. Two questions, asked on real examples:
- Did the right passage get retrieved at all? If the answer is not in the retrieved set, no amount of clever prompting will save the generation step.
- Given the right passage, did the model use it faithfully? If retrieval was good but the answer drifted, the problem is grounding, not search.
Splitting the evaluation this way tells you which half to fix. Lumping them together tells you only that "it's wrong," which is not actionable.
What RAG does not fix
RAG grounds answers in supplied text. It does not make a model reason better, and it does not guarantee truth — if your source documents are wrong or out of date, the answer will be confidently wrong in exactly the same voice. It also adds moving parts: an indexing step, a retrieval step, and their own failure modes to monitor. RAG is the right tool when answers must reflect specific, changing, or private information. It is overkill when the model already knows enough and you just need a better prompt.
Where RAG is heading
The frontier of RAG is mostly about making retrieval smarter: deciding when to retrieve, retrieving in multiple passes, letting the model issue its own search queries, and re-ranking results before they reach the generator. These add capability and complexity in equal measure. The first-principles view still holds underneath all of it — they are all ways of getting better material in front of the model before it answers.
The takeaway
Forget the product stack for a moment. RAG is the discipline of letting the model read the right material before it answers. Embeddings and vector stores are a popular implementation, not the essence. Get the retrieval right, chunk with care, and instruct the model to stay grounded, and RAG turns a model that guesses from memory into one that answers from evidence — with sources you can check.
