Distillation: teaching small models from big ones

Knowledge distillation trains a small model to imitate a large one. The trick is not copying answers, but copying the way the big model is unsure.

research2026-05-21 13:52 KST·Lead Editor·7 min read

The biggest, most capable models are also the most expensive to run. They are slow, they are costly per query, and they often will not fit on the hardware you actually have. Knowledge distillation is the technique that lets you keep most of that capability while shedding most of the cost. The idea, in one line: train a small model to imitate a large one. The interesting part is what "imitate" turns out to mean.

The technique was popularized in the deep-learning era as a way to compress an unwieldy, accurate model into a compact, deployable one. The setup has a memorable name. The large model is the teacher; the small model is the student. The student is trained not to rediscover the task from scratch, but to reproduce the teacher's behavior.

Why not just train the small model directly

The obvious alternative is to train a small model on the same labeled data the teacher saw and skip the teacher entirely. Sometimes that works. Often it does not, and the reason is informative.

Real training labels are usually hard labels: this image is a cat, full stop. That single answer throws away a lot of what a well-trained teacher knows. A good teacher does not just say "cat" — it says "almost certainly cat, slightly possible dog, definitely not airplane." That distribution of confidence across all the options is a far richer training signal than a one-word label. It encodes which mistakes are reasonable and which are absurd. A small model trained on hard labels never sees that; a student trained on the teacher's full output does.

Soft targets: the heart of the idea

The teacher's full probability distribution over possible answers is often called its soft targets (as opposed to the hard target of a single correct label). These soft targets carry what researchers sometimes call dark knowledge — the relationships the teacher has learned that are invisible in the labels themselves.

Consider digit recognition. A handwritten 7 might draw a little probability toward 1, because sevens and ones can look alike, and almost none toward 8. That tiny lean toward 1 is real information about the shape of the input and the structure of the problem. Training the student to match the whole distribution — not just the top answer — transfers that structure. The student learns the teacher's worldview, not just its conclusions.

To make these soft targets even more informative, distillation often softens the distribution further, spreading out the probabilities so the small differences between the runner-up options become more pronounced and easier to learn from. The student is asked to match this softened picture closely.

What gets transferred, and what does not

Distillation transfers behavior, not understanding. The student learns to produce outputs that look like the teacher's outputs on the kinds of inputs it was trained on. That is powerful and also bounded:

It is only as good as the coverage. The student imitates the teacher on the examples it sees. On inputs unlike anything in the distillation data, the student has no teacher to copy and falls back on whatever it managed to generalize.
It can inherit the teacher's flaws. If the teacher is biased, overconfident, or wrong in a systematic way, the student copies that too. Distillation is faithful imitation, including faithful imitation of mistakes.
It rarely exceeds the teacher on the distilled task. The student is chasing the teacher's behavior; the teacher is the ceiling for that specific signal, even if the student is more efficient.

None of this makes distillation less useful. It just sets expectations: you are buying efficiency, not new capability.

Distillation for language models

The same idea applies to large language models, with some twists. A language model predicts the next token as a probability distribution over the vocabulary, so its soft targets are exactly the kind of rich signal distillation thrives on. A student model can be trained to match the teacher's next-token distributions across a large body of text.

There is a second, increasingly common flavor that does not require access to the teacher's internal probabilities at all. Here the teacher simply generates outputs — answers, explanations, worked solutions — and the student trains on that generated text as if it were ordinary training data. This is sometimes called sequence-level or generation-based distillation, and it blurs into the broader practice of training on model-produced data. It is convenient because it works with any teacher you can query, even one you can only reach through an interface that returns text.

Both flavors share the core bet: a smaller model can carry a surprising fraction of a larger model's competence if you train it on the larger model's behavior rather than on raw labels alone.

Why this matters in practice

Distillation is one of the main reasons capable AI can run cheaply and close to where it is needed. A distilled model can be small enough to serve at high volume, fast enough for interactive use, and compact enough to run on modest hardware. For many real deployments, the question is not "what is the most capable model in existence?" but "what is the most capable model I can afford to run a million times a day?" Distillation moves that frontier.

It also enables a useful division of labor: invest heavily in one large, expensive teacher, then distill it into a family of smaller students tuned for different cost and latency budgets. You pay for the hard work once and amortize it across many cheaper models.

The honest trade-offs

Distillation is not free, and it is not lossless.

You give up some quality. The student is smaller; on the hardest inputs the gap between teacher and student shows. The art is choosing a student size where the loss is acceptable for your use case.
It needs the right data. The student only learns where the teacher demonstrates. Choosing what to distill on — covering the inputs you actually care about — matters as much as the algorithm.
It can amplify quiet failures. Because the student copies the teacher uncritically, a subtle teacher bias can become baked into a model you then ship widely.

Knowing these limits is what separates distillation as a reliable engineering tool from distillation as a hopeful shortcut.

The takeaway

Knowledge distillation trains a small student to imitate a large teacher — and the key insight is that the most valuable thing to copy is not the teacher's final answer but its full distribution of confidence, the soft targets that reveal how the teacher reasons about uncertainty. That richer signal lets a compact model carry much of a large model's competence at a fraction of the cost. It will not exceed its teacher, and it inherits its teacher's flaws, but as a way to turn expensive capability into deployable capability, distillation is one of the most quietly important techniques in modern machine learning.

#distillation#compression#training#efficiency

Primary sources

arXiv Hugging Face documentation