Quantization and distillation: making models smaller

"Two different ways to shrink a model — one changes its numbers, the other trains a smaller copy. Here is how each works and when to reach for it."

models2026-04-12 16:37 KST·Lead Editor·7 min read

Large models are expensive to run. They need a lot of memory, a lot of computation, and they respond more slowly than smaller ones. So a great deal of practical AI work is about making models smaller without making them noticeably worse. Two techniques dominate that effort, and they are often mentioned in the same breath even though they work on completely different principles. Quantization shrinks a model by storing its numbers more compactly. Distillation shrinks a model by training a smaller one to imitate a larger one. Understanding the difference is the key to knowing which lever to pull.

This piece explains both in plain language, walks through the trade-offs each one makes, and offers a practical way to think about when smaller is the right goal in the first place.

Why size is a problem worth solving

A model is, at bottom, an enormous collection of numbers called parameters — the values it learned during training. Bigger models have more parameters, and more parameters mean more memory to hold them and more arithmetic to run them. That translates directly into cost, into latency, and into where a model can physically live. A model small enough to run on a phone or a modest server opens up uses that a giant model in a data center cannot serve cheaply or quickly.

So the goal of compression is not smallness for its own sake. It is to fit a model into a tighter budget of memory, money, or time while keeping as much of its capability as possible. The interesting question is always the same: how much can you shrink before quality drops in a way that matters for your task?

Quantization: storing the same model more compactly

Every parameter in a model is a number, and numbers can be stored at different levels of precision. A high-precision number reserves many bits to capture fine gradations; a low-precision number uses fewer bits and captures the value more coarsely. Quantization is the act of converting a model's parameters from high precision to lower precision — packing each number into less space.

The intuition is like saving a photograph at a lower quality setting. The image is still recognizable, the file is much smaller, and for most purposes you cannot tell the difference. A quantized model keeps the same structure and the same learned knowledge; it just represents each value more coarsely. Because the numbers take less room, the model uses less memory and often runs faster, since moving and multiplying smaller numbers is cheaper.

The catch is that coarser numbers lose detail. Push the precision low enough and the rounding errors accumulate until the model's behavior degrades — answers get less accurate, especially on harder tasks. There is a sweet spot, and finding it is the whole game. Modest quantization is often nearly free in quality; aggressive quantization starts to bite.

Distillation: training a smaller model to imitate a bigger one

Distillation takes a different route. Instead of compressing an existing model's numbers, it builds a new, smaller model from scratch and trains it to mimic a large one. The large model is the "teacher," the small model is the "student," and the student learns by watching the teacher's outputs and trying to reproduce them.

The reason this works better than simply training a small model on raw data is subtle. When a teacher model responds to an input, its output carries more information than a plain right-or-wrong label — it reflects how the teacher weighs different possibilities, which captures something about the structure of the problem. The student learns from that richer signal. The result can be a much smaller model that performs surprisingly close to its teacher on the kinds of inputs it was trained to imitate, because it inherited the teacher's learned judgment rather than rediscovering it alone.

The catch here is different from quantization's. A distilled model is genuinely a different, smaller model, so its ceiling is lower. It tends to do well on the territory its teacher demonstrated and can fall off on inputs far from that territory. Distillation also costs real effort up front — you have to run the training process — whereas quantization is comparatively quick to apply to a finished model.

The core difference, stated plainly

Quantization keeps the same model and changes how its numbers are stored. Distillation throws away the large model and trains a new smaller one to act like it. One is a change of representation; the other is a change of model. That distinction explains every trade-off between them: quantization is fast to apply and preserves the original's full structure but is limited in how far it can shrink before quality suffers, while distillation can produce dramatically smaller and faster models but requires training effort and yields a model with a genuinely lower ceiling.

Choosing between them — and combining them

The practical question is what your constraint actually is. If you have a model you like and simply need it to fit in less memory or run a bit faster, quantization is the lower-effort first move; apply a moderate level, measure quality on your own task, and stop before it degrades. If you need something far smaller and faster than quantization alone can deliver — small enough for a constrained device, or cheap enough to run at very high volume — distillation is the heavier tool that gets you there.

Crucially, the two are not rivals. They operate on different principles, so they stack. A common production path is to distill a large model down to a smaller student and then quantize that student, capturing both kinds of savings. The smaller model from distillation reduces the parameter count; the quantization reduces the storage cost of each remaining parameter. Together they can compress far more than either alone, as long as you keep measuring quality at each step.

How to know if you went too far

Neither technique tells you when it has hurt your application — only your own evaluation can. The discipline is the same in both cases: assemble a small set of inputs that represent your real task, with a clear notion of what a good output looks like, and compare the compressed model against the original on exactly those inputs. Watch especially the harder examples and the edge cases, because compression tends to degrade the difficult tasks first while leaving easy ones looking fine. A model that still aces simple inputs but quietly fails the hard ones is the classic sign you shrank past the sweet spot.

The takeaway

Quantization and distillation both make models smaller, but they are not the same move. Quantization stores the same model's numbers more compactly — fast to apply, structure preserved, limited in how far it goes. Distillation trains a new, smaller model to imitate a larger one — more effort, far greater shrinkage, a lower ceiling. Pick by your constraint, combine them when you need to, and let an evaluation built from your own task tell you when small has become too small. Smaller is only better when it is still good enough at the thing you actually need it to do.

#quantization#distillation#model-compression#efficiency

Primary sources

Hugging Face Documentation arXiv