Context windows explained: tokens, attention, and where long context breaks

A bigger context window is not the same as better memory. Here is what a context window really is, why long inputs degrade, and how to design around it.

models2026-06-02 10:06 KST·Lead Editor·7 min read

The context window is one of the most quoted and least understood numbers in modern AI. A larger one sounds strictly better, the way more RAM or more storage sounds better. It is not that simple. A context window is a working space with real limits and a quiet failure mode, and treating it like infinite, perfect memory is how teams ship systems that mysteriously forget the thing you told them three paragraphs ago.

This piece explains what a context window is in concrete terms, why the underlying mechanism makes long inputs expensive and imperfect, and how to design a system that respects those limits instead of pretending they do not exist.

What a context window actually is

A language model does not read characters or words directly. It reads tokens — chunks of text, often a word, part of a word, or a punctuation mark. A rough rule of thumb is that a token corresponds to a few characters of English, so a page of text is some hundreds of tokens. The exact mapping depends on the model's tokenizer, but the principle is stable: text becomes a sequence of tokens before the model sees anything.

The context window is the maximum number of tokens the model can consider at once — the input you provide plus the output it generates, sharing the same budget. Everything the model "knows" in a single interaction lives inside this window: your instructions, the documents you pasted, the conversation so far, and the answer being written. There is no separate long-term memory. When people say a model "remembers" earlier in a chat, what they mean is that the earlier text is still inside the window. The moment something falls outside it, the model has no access to it at all.

This is the first durable insight: the context window is short-term working memory, not storage. It does not persist, it does not grow on its own, and nothing inside it is guaranteed to be used.

Attention: the engine and its cost

To understand why long context is hard, you need a feel for attention, the mechanism that lets the model relate each token to the others. For every token it processes, the model weighs how much each other token in the window should influence it. That is what lets a model connect a pronoun to the noun it refers to, or a question to the relevant sentence buried earlier in a document.

The crucial property is how that cost grows. Because every token can attend to every other token, the work scales roughly with the square of the number of tokens. Double the input and you do not double the work — you roughly quadruple it. This is why processing very long inputs is disproportionately expensive in both time and money, and why the context window has a hard ceiling rather than being arbitrarily large. The square-law cost is a fundamental tax on length. Various techniques reduce it, and research on more efficient long-context methods is active and ongoing, but the basic pressure never disappears: longer input is superlinearly more expensive to attend over.

Tokens are the unit of cost and limits

Because everything is measured in tokens, tokens are also the unit of nearly every practical constraint you will hit:

You are billed by tokens on hosted models, input and output both. A long document in the prompt costs real money every time you send it.
Latency tracks tokens. More tokens in the window generally means a slower response, both because there is more to read and because generation competes for the same budget.
The window is shared. A huge input leaves less room for the answer. If you fill the window with context, you can starve the output.

A practical habit follows: treat tokens as a budget you spend deliberately. Padding the prompt with "just in case" material is not free insurance — it costs money, adds latency, and, as the next section explains, can actually make answers worse.

Where long context breaks: the lost middle

Here is the failure mode that surprises people most. Even when text fits comfortably inside the window, the model does not attend to all of it equally. A widely observed pattern is that models use information at the beginning and end of a long input more reliably than information buried in the middle. Put the critical fact in the center of a long document and the model may behave as if it never saw it — even though, technically, it is right there in the window.

This means a large context window does not guarantee that the model will use everything you put in it. Fitting and using are different things. The capacity number on a datasheet tells you what fits; it tells you almost nothing about what gets reliably used.

The same effect explains a counterintuitive result teams keep rediscovering: stuffing more documents into the prompt can lower answer quality rather than raise it. More irrelevant text dilutes attention and pushes the relevant part deeper into the middle, where it is easiest to miss. With context, more is not automatically better, and is sometimes worse.

Designing for the limits, not around them

You cannot abolish these constraints, but you can design so they rarely bite. The governing principle is simple: put less in, and put the right things where they will be seen.

Retrieve, do not dump. Instead of pasting an entire knowledge base, fetch only the handful of passages relevant to the current question and include just those. This is the core idea behind retrieval-augmented systems, and it exists precisely because dumping everything is both expensive and unreliable.
Position matters. Place the most important instructions and the most relevant evidence near the start or the end of the prompt, not buried in the middle of a long block.
Summarize the past. In a long conversation, periodically compress earlier turns into a short summary rather than carrying every word forward. This keeps the salient facts in the window without spending the whole budget on transcript.
Leave room for the answer. Reserve enough of the window for the output. A prompt that fills the window to the brim can truncate or degrade the response.
Test recall over your real lengths. If your use case involves long inputs, build a small evaluation that hides a known fact in the middle of a realistic document and checks whether the model retrieves it. Measure the failure mode directly instead of assuming the capacity number protects you.

A worked example

Imagine a support assistant that answers from a product manual. The naive design pastes the entire manual into every prompt, trusting the large window to handle it. It will be slow, expensive, and — because the relevant paragraph is usually somewhere in the middle — unreliable. The disciplined design indexes the manual, retrieves the two or three passages that match the user's question, places them clearly near the end of the prompt, and leaves ample room for the answer. The second system is cheaper, faster, and more accurate, despite using a far smaller fraction of the available context. That is the whole lesson in one example: using the window well beats filling it.

The takeaway

A context window is the model's short-term working memory, measured in tokens, shared between your input and its output. Attention is what makes it useful and, because its cost grows with the square of length, what makes it limited. The capacity number tells you what fits, not what the model will reliably use — and the well-documented weakness in the middle of long inputs means more text is not the same as better answers. Design accordingly: retrieve instead of dump, position the important things where they are seen, summarize the past, leave room to respond, and test recall on your real lengths. The teams that respect the window's limits get more out of it than the teams that just buy a bigger one.

Sourcing note: the size of context windows and the specific techniques for extending them change rapidly, so this explainer focuses on the durable mechanics. For current capacities and methods, consult official model documentation and primary research directly.

#context-window#tokens#attention#long-context

Primary sources

Hugging Face Documentation arXiv