Build a simple RAG pipeline: a conceptual walkthrough

Retrieval-augmented generation, built up one stage at a time. No magic, no specific stack — just the shape of the pipeline and the decisions that matter.

tutorials2026-04-25 19:17 KST·Lead Editor·7 min read

Retrieval-augmented generation (RAG) sounds like a single technique. It is really a pipeline: a sequence of small, ordinary stages that together let a language model answer questions using your own documents instead of only what it memorized during training. None of the stages is clever on its own. The skill is in connecting them so that, at the moment the model writes its answer, the right passage of your text is sitting in front of it. This walkthrough builds that pipeline up one stage at a time and points out where each one tends to break.

Why RAG exists at all

A language model knows what it was trained on. It does not know your company's handbook, last week's incident report, or the contents of a PDF you uploaded a minute ago. You could fine-tune the model on that material, but that is expensive, slow to update, and easy to get wrong. RAG takes the cheaper path: leave the model alone and, for each question, find the relevant passages and paste them into the prompt. The model then answers from text it can see rather than from memory it may not have.

The mental model is a competent assistant with an open book. The assistant is smart but ignorant of your specifics; the book has the specifics but cannot reason. RAG hands the assistant the right page at the right time. Everything below is in service of "the right page at the right time."

Stage 1: Chunking your documents

You cannot hand the model an entire library, so you split your documents into smaller pieces called chunks. A chunk is just a passage — a few paragraphs, a section, a page. Chunking matters more than it looks. Chunks that are too large dilute the relevant sentence with surrounding noise and waste room in the prompt. Chunks that are too small lose the context that makes a sentence meaningful — a line that says "this is not supported" is useless without the paragraph saying what "this" is.

A reasonable default is to chunk along the document's natural structure: by section, heading, or paragraph, rather than by a blind character count. Keep related ideas together. Many pipelines also let chunks overlap slightly, so a sentence near a boundary still appears whole in at least one chunk. There is no universal right size; it depends on your documents, and it is worth revisiting once you can measure results.

Stage 2: Embeddings and the vector store

To find relevant chunks later, you need a way to compare a question against every chunk by meaning, not just by matching keywords. This is what embeddings provide. An embedding model turns a piece of text into a list of numbers — a vector — positioned so that texts with similar meaning land near each other in that numeric space. "How do I reset my password?" and "steps to recover account access" use almost no shared words but sit close together as vectors.

You run every chunk through the embedding model once, storing each chunk's vector alongside the original text. That collection lives in a vector store: a database built to answer "which stored vectors are nearest to this one?" quickly, even across millions of entries. For a small project the vector store can be a simple in-memory structure; at scale it is a dedicated database. The interface is the same either way: put vectors in, ask for nearest neighbors back.

Stage 3: Retrieval at query time

Now the pipeline runs live. A user asks a question. You embed the question with the same model you used for the chunks — this matters, because vectors from different models are not comparable. You hand that question-vector to the vector store and ask for the nearest chunks. The store returns the top handful: the passages whose meaning is closest to the question.

"How many" is a real decision. Return too few and you risk missing the passage that holds the answer. Return too many and you crowd the prompt with marginally related text, which both costs more and distracts the model. A small number — enough to cover the answer, not so many that signal drowns in noise — is the usual starting point. Retrieval is also where pure semantic search sometimes stumbles on exact terms like product codes or names, which is why some pipelines blend it with old-fashioned keyword search. Start simple; add that only if you see the failure.

Stage 4: Assembling the prompt

You now have the user's question and a few retrieved chunks. The generation step assembles them into a single prompt. Conceptually it looks like:

You are answering using only the context below.
If the answer is not in the context, say you don't know.

Context:
[chunk 1 text]
[chunk 2 text]
[chunk 3 text]

Question: [the user's question]

Two instructions in there are doing quiet, heavy lifting. "Using only the context below" tells the model to prefer the supplied passages over its own memory, which is the entire point of RAG. "If the answer is not in the context, say you don't know" gives the model permission to decline — without it, a model tends to fill the gap with a confident guess. Naming that failure mode is the difference between an honest "not found" and a fabrication.

Stage 5: Generation and citing sources

The assembled prompt goes to the language model, which writes the answer grounded in the retrieved text. Because you kept each chunk's original source, you can do something fine-tuning cannot: show where the answer came from. Carry an identifier with each chunk — document title, section, page — and ask the model to reference it, or simply display the source passages beneath the answer. Citations turn an opaque response into one a user can verify, and verifiability is often what makes a RAG system trustworthy enough to deploy.

This is also where the pipeline's honesty is tested. If retrieval handed over the wrong passages, the model will answer fluently from the wrong passages. A confident answer is not evidence of a correct one. Which leads directly to the part most first builds skip.

Where RAG pipelines actually break

The failures are rarely in the model. They are upstream, in retrieval. If the relevant chunk was never returned, the best model in the world cannot use it — "garbage in, fluent garbage out." The usual culprits: chunks split so that the answer straddles a boundary and appears whole in none of them; an embedding model that does not capture your domain's vocabulary; or simply asking for too few chunks. When a RAG system gives a wrong answer, resist blaming the model first. Look at what retrieval actually returned for that question. Most of the time the answer was not in the retrieved set at all, and the fix is in chunking or retrieval, not in the prompt.

The way to catch this is to evaluate the pipeline on real questions whose answers you already know, and to inspect the retrieved chunks, not just the final text. A pipeline that retrieves the right passage and then answers well is working. One that answers well by luck while retrieving the wrong passage is a bug waiting for a harder question.

The takeaway

RAG is not one trick but a short assembly line: chunk your documents sensibly, embed them into a vector store, retrieve the nearest chunks for each question, assemble them into a grounded prompt, and generate an answer that cites its sources. The model is the easy part. The quality of the whole system is decided by chunking and retrieval — getting the right page in front of the model at the right moment. Build each stage plainly, then measure retrieval on real questions, and you will spend your effort where the failures actually live rather than where they look most impressive.

#rag#retrieval#embeddings#tutorial

Primary sources

Anthropic — build with Claude documentation OpenAI — embeddings guide