Scaling laws: bigger, but why

"Make it bigger" sounds like a slogan, not a science. Scaling laws are what turned it into one. Here is what they actually say, and what they do not.

research2026-04-17 16:38 KST·Lead Editor·7 min read

"Just make it bigger" is the caricature of how modern AI progress happens, and like most caricatures it contains a real face. The serious version of that idea is called a scaling law, and it is one of the most consequential findings in the field. Scaling laws are what turned "bigger is better" from a hunch into something predictable enough to plan billion-dollar projects around. Understanding them clears up a great deal of confusion about why models keep improving and what those improvements cost.

The core finding, stated plainly: as you increase the size of a model, the amount of data it trains on, and the computation spent training it, the model's performance improves in a smooth, predictable way. Not in lucky jumps — smoothly, and reliably enough to forecast.

What a scaling law actually claims

A scaling law is an observed relationship between the resources you put into training a model and how well that model performs at predicting text. Researchers measured this by training many models of different sizes, on different amounts of data, with different amounts of compute, and plotting how performance changed.

What they found was not noise. The points fell along a remarkably clean curve. Performance improved steadily as resources grew, and it did so in a regular enough pattern that you could extrapolate: given how a small model performed, you could predict with surprising accuracy how a much larger one would. That predictability is the whole reason scaling laws matter. They turned model-building from guesswork into something closer to engineering, where you can estimate what a given investment will buy before you spend it.

Three ingredients drive the curve: the number of parameters in the model, the amount of training data, and the total compute used. Push any of them up, in balance with the others, and performance improves along the expected path.

Why bigger keeps helping

It is reasonable to expect that piling on more size would hit a wall quickly. A model is, after all, just predicting the next piece of text. Why would making it ten times larger keep paying off rather than saturating?

The intuition is that language and the world behind it are extraordinarily rich. There is a near-bottomless supply of patterns to learn: rarer words, subtler grammatical structures, less common facts, more intricate reasoning chains, more specialized domains. A small model can only capture the most common, most obvious regularities. A larger one trained on more data has the capacity to absorb the long tail — the patterns that show up rarely but collectively make up a huge fraction of real language.

So scaling does not work because bigger models are magically smarter. It works because there is so much structure to learn that models had not been large enough to capture all of it. Adding capacity and data lets them reach further into that structure. The curve keeps bending downward because the supply of learnable patterns has not run out.

The balance between size and data

One of the most useful refinements to scaling laws is that the three ingredients have to grow together. It is not enough to make a model enormous if you starve it of data, nor to flood a tiny model with more text than it can absorb. For a given amount of compute, there is a balanced split: a model of a certain size trained on a certain amount of data.

Early in the field, models were often made very large relative to the data they saw. Later work showed that for the same compute budget, a somewhat smaller model trained on substantially more data could do better. The lesson was not "size matters less" but "size and data must be matched." Spending your compute in the right proportions matters as much as how much compute you have.

This balance is why you cannot read a model's quality from its parameter count alone. A smaller model trained on more data, in better balance, can outperform a larger one trained on too little. The headline number is only part of the story.

Why predictability changed everything

The practical power of scaling laws is forecasting. Training a frontier model is enormously expensive, and you only get to do it a few times. Without scaling laws, each attempt would be a gamble: build the biggest thing you can afford and hope it works.

Scaling laws remove much of that gamble. Because performance follows a predictable curve, teams can train a series of small, cheap models, fit the curve, and extrapolate to estimate how a far larger model will perform before committing to building it. They can also use the laws to decide how to spend a fixed budget — how large to make the model, how much data to gather — to get the best result. This is why scaling laws are sometimes described as the planning tool of modern AI. They convert a high-stakes bet into a calculated investment.

The catch: what scaling laws measure

Here is the crucial fine print. Scaling laws predict how well a model does at its training objective — broadly, how well it predicts text. They do not directly predict the things people actually care about, like whether the model can reason through a hard problem, follow instructions, or avoid making things up.

The connection between the two is real but loose. Better text prediction tends to come with better downstream abilities, but the relationship is not tidy, and improvements on the training objective do not map cleanly onto improvements on any specific task. A model can get measurably better at its objective while a particular capability you care about barely moves, or jumps unexpectedly. So scaling laws are a reliable guide to one quantity and only an indirect guide to the abilities that quantity is supposed to underwrite.

What scaling does not promise

It is tempting to read scaling laws as a guarantee that more resources will solve everything. They promise less than that. They describe a trend observed over the ranges studied, and a smooth curve seen so far is not a contract that it continues forever. Every such trend eventually meets some limit — of available data, of useful compute, of the patterns left to learn.

Scaling also does not, on its own, deliver judgment, reliability, or honesty. Those come from how a model is shaped after the raw capability is built, not from size alone. And scaling has costs that grow as fast as its benefits: enormous compute, energy, and data requirements. Scaling laws explain why bigger has kept helping, and they help plan how to spend resources, but they are a description of a pattern, not a law of nature promising the pattern never ends.

The takeaway

Scaling laws are the finding that model performance improves smoothly and predictably as you grow model size, data, and compute together. That predictability is their real significance: it turned "make it bigger" from a slogan into a planning tool, letting teams forecast and budget instead of gamble. But the laws measure how well a model predicts text, not the specific abilities we ultimately want, and they describe a trend rather than guarantee it forever. Bigger has kept helping because there is so much structure left to learn — and reading the laws for exactly what they claim, and no more, is how you avoid overreading them.

#scaling-laws#compute#training#research

Primary sources

arXiv — machine learning research Stanford CRFM — HELM