Why context length is hard to scale

A longer context window sounds like a simple knob to turn. Underneath it fights a cost that grows faster than the text — and attention that spreads thin.

research2026-06-08 18:48 KST·Lead Editor·7 min read

The context window is how much text a language model can take in at once — the prompt, the documents, the conversation so far. Bigger is obviously better: more room means you can feed a model a whole report, a long codebase, or an entire conversation history without leaving anything out. So why not just make every model's window enormous? Because under the hood, length is not a free parameter. It fights against the very mechanism that makes these models work, and the cost grows faster than the text does. This piece is about why scaling context is genuinely hard, not merely a matter of allocating more memory.

There are really two problems stacked on top of each other: a problem of cost, and a problem of quality. Both have to be solved, and they pull in different directions.

The mechanism that makes length expensive

Most modern language models are built on attention, the mechanism that lets each token look at other tokens to figure out what is relevant. Attention is what gives these models their power: every piece of the input can, in principle, relate to every other piece.

That phrase — every piece to every other piece — is also the source of the trouble. In the standard form of attention, the amount of work scales with the number of token pairs. Double the length of the input and you do not double the work; you roughly quadruple it, because the number of pairs grows with the square of the length. This is the famous quadratic cost of attention, and it is the central reason long context is expensive.

The intuition is simple: if you have N things and each must consider every other, you have on the order of N-times-N relationships to compute. Make N ten times bigger and the relationships grow a hundredfold. A window that is ten times longer can cost far more than ten times as much to process. Length, in attention, is not a linear expense.

The memory problem on top of the compute problem

Cost is not only about computation; it is also about memory. To process and especially to generate text efficiently, a model holds onto intermediate representations of the tokens it has already seen — a running store often called the key-value cache. This cache lets the model avoid recomputing everything from scratch for each new token.

The catch is that this store grows with the length of the context. The longer the window, the more of it the model must keep in fast memory at once. For very long contexts, this memory footprint can become the binding constraint — you run out of room to hold the context before you run out of patience for the compute. So long context strains two scarce resources simultaneously: the computation to relate everything, and the memory to hold everything. Engineering long-context systems is largely a battle on both fronts.

The quieter problem: attention spreads thin

Suppose you pay the cost and make the window huge. A second, subtler problem appears, and it is about quality rather than expense. Attention works by distributing focus across the input. When there are only a few things to attend to, focus is concentrated. When there are a great many, that focus has to spread across all of them, and the truly relevant pieces can get diluted among a sea of irrelevant ones.

A long context can therefore make it harder for a model to find and use the one detail that matters, even though the detail is technically present. The information is in the window, but it is competing for attention with everything else in the window. A bigger haystack does not help you find the needle; it can bury it.

This is why a model with a very long window does not automatically use that window well. There is a real difference between being able to fit a long input and being able to reason over it effectively. The first is about capacity; the second is about whether attention can still pick out what counts when there is so much to consider.

Lost in the middle

A particularly well-known version of this quality problem is the tendency of models to use information unevenly depending on where it sits in a long context. Material near the beginning and near the end of a long input often gets used more reliably than material buried in the middle. Put the crucial fact in the center of a long document and the model may effectively overlook it, even though it read it.

This pattern — sometimes described as a model being weaker in the middle of its context — is a reminder that a long window is not a uniform, perfect memory. It is a span of attention with its own geography of strengths and weak spots. Knowing this changes how you use long context: placing the most important material where the model attends best, rather than assuming every position is equal, is a practical lever that follows directly from the underlying behavior.

Why you cannot just train on longer text

You might think the fix is simply to train the model on very long examples until it learns to handle them. It helps, but it does not come free. Training on long sequences inherits the same quadratic cost, so it is expensive in exactly the way using long context is expensive. And a model trained mostly on shorter text may not generalize gracefully when suddenly shown an input far longer than anything it saw in training — its sense of position and relevance can degrade past familiar lengths. Extending context is thus not a switch to flip; it is a property that has to be deliberately built and paid for, in both training and serving.

How the field pushes the limits

Because the obstacles are real, much of the work on long context is about changing the rules rather than brute-forcing them. The directions are intuitive even without the math.

Cheaper attention. A large research effort aims to approximate attention so its cost grows closer to linearly with length instead of quadratically, by avoiding the full all-pairs computation. Trading a little exactness for a lot of scalability is the recurring theme.
Shrinking the memory footprint. Techniques to compress or economize the key-value cache let longer contexts fit in the same memory, attacking the memory side of the problem.
Retrieving instead of holding everything. Rather than stuffing an entire corpus into the window, fetch only the relevant pieces and feed those in. This sidesteps the length problem by keeping the actual context small — the strategy behind retrieval-augmented approaches.

Each path makes a trade. Cheaper attention gives up some precision; compression gives up some fidelity; retrieval gives up the guarantee that everything is present at once. There is no free lunch, only different bargains.

The takeaway

Context length is hard to scale because the dominant mechanism behind these models, attention, costs work that grows with the square of the input, while the key-value cache needed to run efficiently grows with length too — so longer windows strain both compute and memory faster than the text grows. And even when you pay that price, a longer context spreads attention thinner and uses its middle less reliably, so fitting more text is not the same as reasoning over it well. The frontier moves by changing the bargain — cheaper attention, leaner memory, or retrieving only what matters — each trading something for reach. A bigger window is never just a bigger number; it is a fight against a cost that compounds.

#context-window#attention#scaling#transformers

Primary sources

arXiv Anthropic documentation