Emergent abilities: real or mirage?

Big models seem to suddenly "get" skills smaller ones lack. Is that a real phase change, or a trick of how we measure? The honest answer is: both.

research2026-04-03 08:35 KST·Lead Editor·7 min read

One of the most striking and most argued-over claims about large language models is that they show emergent abilities: skills that are absent in smaller models and appear, seemingly all at once, when models cross some threshold of scale. The image is dramatic — a capability that simply was not there suddenly switches on. It has fueled both excitement and unease about where scaling leads. It has also been challenged hard. The honest picture is more interesting than either the hype or the debunking suggests.

The question at the center of it all: when a large model can do something a smaller one cannot, is that a genuine phase change in the model, or an artifact of how we chose to measure it? Getting this right matters for how we think about what scaling will and will not deliver.

What "emergent" is supposed to mean

The claim is specific. An ability is called emergent if a model's performance on some task stays flat and near-useless across a wide range of smaller sizes, and then rises sharply once the model passes a certain scale. Plotted against size, the curve looks like a flat line followed by a sudden cliff upward. The ability appears to be qualitatively new, not a smooth continuation of what came before.

This is a stronger claim than "bigger models are better." Better-in-general is expected and follows the smooth curves of scaling laws. Emergence says something extra: that certain capabilities are not gradually acquired but instead snap into existence past a threshold, in a way you could not have predicted by watching smaller models. If true, it would mean scaling holds surprises — abilities we cannot see coming until they suddenly arrive.

Why people believed it

The belief did not come from nowhere. Across many tasks, researchers genuinely observed this pattern: small and medium models scored at chance, and then larger models scored well, with the jump appearing concentrated in a narrow band of scale. Multi-step reasoning, certain kinds of arithmetic, following intricate instructions — these often looked like they had a switch that flipped on only above some size.

For tasks like these, the smaller models really did seem incapable, not merely worse. A model that gets a multi-step problem entirely wrong, every time, looks categorically different from one that gets it right. The leap from "never" to "often" feels like a change in kind, not degree. That intuition — that something new had appeared — is what made emergence such a compelling and widely repeated idea.

The deflating counter-argument

Then came a sharp critique, and it landed on the measurement. Many of the tasks where emergence showed up were scored in an all-or-nothing way: the model got full credit only for a completely correct answer and zero for anything else. On a multi-step problem, getting nine of ten steps right still scores zero under that rule.

Under such a metric, a model can be improving steadily and invisibly — getting more and more of the steps right — while its score stays pinned at zero, because it has not yet crossed the line of getting everything right. Then, when it finally does cross that line, the score leaps. The underlying ability grew smoothly; only the harsh scoring rule made it look like a sudden jump. Measured with a gentler metric that gives partial credit, many supposedly emergent curves straighten out into the same smooth improvement scaling laws predict. The cliff, on this view, was in the ruler, not the model.

Why this is not the end of the story

It would be tidy to conclude that emergence is entirely an illusion of measurement. But that goes too far. The critique shows convincingly that some apparent emergence is a metric artifact, and that all-or-nothing scoring can manufacture cliffs out of smooth progress. It does not show that every surprising capability gain is so explained.

Even when the underlying curve is smooth, there is a real and important sense in which a capability becomes usable only past a certain point. An ability that is technically present but only completes a task one time in a thousand is, for practical purposes, absent; the same ability completing the task most of the time is, for practical purposes, new. From the standpoint of someone using the model, that transition matters even if the internal curve was gradual all along. Smooth underneath can still mean a meaningful threshold for use.

Untangling three different claims

The confusion clears up once you separate three things people mean by emergence. The first is smooth capability growth, which is just scaling working as expected and is not surprising. The second is sharp curves caused by harsh metrics, which are largely artifacts and can be smoothed away by better measurement. The third is genuine thresholds of usefulness, where a gradually improving ability crosses from impractical to practical and changes what the model is good for in practice.

Most of the heated debate comes from arguing about these as if they were one claim. The deflating critique mainly targets the second. The excited reporting mostly noticed the third. And the first underlies all of it. Disagreements about whether emergence is "real" usually turn out to be disagreements about which of these three someone has in mind.

What this means for predicting the future

The practical stakes are about forecasting. If capabilities truly appeared from nowhere past unpredictable thresholds, then scaling would be genuinely unsafe to reason about — you could never know what the next model would suddenly be able to do. The metric critique is partly reassuring here: much apparent unpredictability dissolves into smooth, forecastable trends once you measure carefully.

But the reassurance is incomplete. Even smooth underlying progress can produce abrupt changes in what a model is useful for, and those practical thresholds are harder to predict than the smooth curves beneath them. So the responsible stance is neither "anything could emerge at any time" nor "nothing ever really emerges." It is that capability tends to grow smoothly, while usefulness can shift suddenly, and careful measurement is what lets you tell which is which.

The takeaway

Emergent abilities are real and a mirage at the same time, depending on what you mean. Much of the dramatic, switch-flips-on appearance is an artifact of all-or-nothing scoring; measured gently, the curves are smooth, and scaling behaves predictably. But a gradually improving ability can still cross a real threshold from useless to useful, and that practical jump matters even when nothing discontinuous happened inside the model. Separate smooth growth, metric artifacts, and thresholds of usefulness, and the argument stops being a yes-or-no fight and becomes what it should have been all along: a question of careful measurement.

#emergence#scaling#evaluation#research

Primary sources

arXiv — machine learning research Stanford CRFM — HELM