Chunk documents well for retrieval

Retrieval is only as good as its chunks. Here is how to split documents so the right passage comes back whole and in context.

tutorials2026-04-29 19:38 KST·Lead Editor·7 min read

When a retrieval system fails, the cause is often not the model or the search algorithm but the chunks. A chunk is the unit you index and retrieve — the piece of a document that gets matched against a query and handed to the model as context. If your chunks are badly cut, even perfect retrieval returns fragments that are too small to be useful, too big to be relevant, or split right through the middle of the answer. Getting chunking right is unglamorous and it determines how well everything downstream works.

Why chunk size decides everything downstream

You can't usually feed an entire document to a model for every query — it's wasteful, and burying the relevant passage in pages of irrelevant text makes the answer worse, not better. So you split documents into chunks, embed each one, and retrieve only the chunks that match the query. The chunk is therefore both the unit of search and the unit of context. Its size and boundaries decide what the model gets to see.

This dual role creates tension. A chunk small enough to be a precise search match may be too small to carry enough context for the model to answer. A chunk large enough to be self-contained may be too broad to match a specific query cleanly, because it mixes several topics and dilutes the signal. Good chunking is the craft of resolving this tension — making chunks that are specific enough to retrieve accurately and complete enough to answer from.

Respect the structure of the document

The worst way to chunk is to cut every N characters regardless of content. That guarantees boundaries that fall mid-sentence, mid-table, mid-thought — splitting an answer across two chunks so that neither one contains it whole. The better default is to cut along the document's own structure: paragraphs, sections, headings, list items. These boundaries exist because the author already grouped related ideas, and a chunk that respects them tends to be a coherent unit of meaning.

Different documents have different natural seams. Prose breaks cleanly at paragraphs and sections. Code breaks at functions or logical blocks. A FAQ breaks at each question-and-answer pair. Tables and lists want to stay intact, because half a table is usually useless. The principle is the same across all of them: let the document tell you where the joints are, and cut at the joints rather than at arbitrary offsets. Structure-aware chunks are coherent chunks.

Size chunks to the question, not a fixed number

There is no universal correct chunk size, because the right size depends on the kind of question you expect. If users ask narrow, fact-level questions, smaller chunks retrieve more precisely — the matching passage isn't diluted by surrounding material. If users ask broad questions that require synthesizing a whole section, chunks need to be large enough to carry that breadth, or the model gets a fragment and misses the point.

The practical move is to think about what a complete answer needs and size chunks to hold it. A chunk should ideally contain enough surrounding context that, read on its own, it makes sense — a passage that depends entirely on the sentence before it, which lives in a different chunk, will retrieve poorly and answer worse. Aim for chunks that are self-sufficient units of meaning for the questions you actually get, and let that guide size rather than picking a round number and hoping.

Use overlap to avoid cutting the answer in half

Even with structure-aware boundaries, an answer sometimes straddles a boundary — the question's context is at the end of one chunk and its conclusion at the start of the next. Overlap is the standard remedy: let consecutive chunks share a bit of text at their edges, so a passage near a boundary appears, intact, in at least one chunk. A small overlap is cheap insurance against splitting an answer down the middle.

Don't overdo it. Heavy overlap inflates your index, returns near-duplicate chunks for the same query, and wastes context budget on repetition. The goal is just enough shared text to keep boundary-spanning answers whole — a modest tail carried into the next chunk, not a large redundant window. Like chunk size, the right amount depends on your content; the principle is to use the smallest overlap that stops answers from being cut in half.

Keep the context a chunk needs to make sense

A chunk ripped out of its document loses information that the document provided implicitly: which section it came from, what the document is about, what "it" refers to three paragraphs up. When that chunk is retrieved in isolation, the model sees the text but not the frame, and the answer suffers. Carrying a little of that frame along with each chunk — the document title, the section heading, a short note on what the chunk concerns — restores context that the cut threw away.

This matters most for retrieval accuracy and for ambiguous passages. A chunk that says "the limit is raised to twice the previous value" is nearly useless without knowing what limit and what previous value; a chunk that carries its section heading gives the model and the search index something to anchor on. The fix is small — a header line or a sentence of context attached to each chunk — and it consistently improves both what gets retrieved and what the model does with it.

Evaluate chunking against real queries

You cannot tell whether your chunking is good by looking at the chunks. You can only tell by running real queries and checking whether the right passage comes back whole and usable. Collect a set of representative questions for which you know the correct source passage, run retrieval, and inspect what comes back: Did the relevant chunk surface? Was it complete, or split? Did irrelevant chunks crowd it out?

Treat chunking as a parameter you tune against that set, not a decision you make once. Change the boundary strategy, the target size, or the overlap, then run the same queries and compare which setup retrieves the right, complete passage more often. The best chunking strategy for your corpus is an empirical question, and the answer differs by document type and query style. The teams whose retrieval works are the ones who measured, not the ones who guessed a chunk size and moved on.

The takeaway

Retrieval lives or dies on its chunks. Cut along the document's own structure rather than at arbitrary offsets, size chunks to hold a complete answer for the questions you actually get, and use a modest overlap so answers near a boundary stay whole. Carry a little context — the title, the heading — along with each chunk so it makes sense in isolation. Then prove the whole thing against real queries with known answers, and tune the boundaries, size, and overlap empirically. Good chunking is invisible when it works and the root cause when it doesn't; spend your effort here and the rest of the pipeline gets easier.

#chunking#retrieval#rag#search

Primary sources

Hugging Face — documentation OpenAI — prompt engineering guide