Synthetic data: training models on model output

When real data runs short, models can generate their own training data. It is powerful, slightly circular, and dangerous if you forget where it came from.

research2026-04-22 11:19 KST·Lead Editor·7 min read

Machine learning has always been hungry for data, and for a while the supply seemed endless. But high-quality data for a specific task — labeled examples, clean instructions, careful demonstrations — is expensive, scarce, and sometimes legally or ethically off-limits. So the field has leaned into an idea that sounds almost paradoxical: let models generate the data used to train models. This is synthetic data, and it has quietly become one of the most important ingredients in how modern AI is built.

The premise feels circular, and in a sense it is. The art lies in making the circularity productive rather than degenerate — in getting more out of a model's output than you put into producing it, without slowly poisoning the well.

Why generate data at all

Real data has real limitations, and each one is a reason to consider synthetic data.

Some data barely exists. Rare events, unusual edge cases, low-resource languages, and uncommon scenarios are exactly the situations you most want a model to handle — and exactly the ones with the fewest natural examples.
Labeling is expensive. Even when raw data exists, turning it into the labeled, instructive form a model can learn from takes human effort that does not scale cheaply.
Real data carries constraints. It can contain private information, fall under usage restrictions, or be impossible to share. Synthetic data can be designed to sidestep those problems.
You can target exactly what you need. Instead of hoping the right examples appear in a corpus, you can ask a model to produce many examples of precisely the skill or situation you want to teach.

That last point is the deepest motivation. Synthetic data lets you manufacture the curriculum. If a model is weak at a certain kind of reasoning, you can generate a flood of focused practice problems for it, rather than scouring the world for naturally occurring ones.

The forms synthetic data takes

"Synthetic data" covers a range of techniques that differ in how much they lean on a model.

The lightest touch is augmentation: take real examples and transform them to create variations — rephrasing a sentence, altering an image slightly — so a small dataset stretches further. The data is mostly real, just multiplied.

A heavier approach is full generation: ask a capable model to produce examples from scratch. Generate questions and answers, write instructions and ideal responses, create worked solutions to problems. Here the model is the source of the data, not just a transformer of it.

A particularly effective pattern uses a strong model to teach a weaker or smaller one. The strong model generates high-quality demonstrations, and those become training data for the student. This overlaps heavily with distillation, and it is one of the main reasons capable behavior can be packed into smaller, cheaper models. The expensive model does the hard thinking once; its output becomes a reusable teaching corpus.

A subtler pattern uses a model to generate and then filter its own output: produce many candidate answers, keep only the good ones by some check, and train on the survivors. The model bootstraps itself by learning from its own best work while discarding the rest.

Why it works at all

It is fair to be suspicious. If a model only knows what it learned, how can its output teach it anything new? The resolution is that generation and learning are not the same operation, and several real mechanisms make the loop productive.

A model can often recognize a good answer more reliably than it can produce one on the first try. By generating many attempts and keeping only the ones that pass a check — a test that runs, a verifier that confirms, a reward signal — you distill scattered competence into clean, consistent training data. The model knew how to be right sometimes; filtering makes "sometimes" into "reliably."

Generation can also restructure existing knowledge into a more learnable form: turning raw text into clean question-and-answer pairs, or a terse solution into a step-by-step explanation. The information was latent; synthetic generation makes it explicit and easy to learn from. And one strong model can transfer its competence to many smaller ones, spreading capability that was expensive to create.

The danger: model collapse

The optimistic story has a sharp limit, and ignoring it is how synthetic data goes wrong. If you train a model purely on the output of models, generation after generation, without grounding in real data, quality can degrade in a process often called model collapse.

The intuition is that a model's output is a lossy reflection of reality. Train on that output and you learn the reflection, not the original. The rare cases and the tails of the distribution — the unusual, the surprising, the hard — are exactly what a model under-represents in its output, so they fade a little with each generation. Repeat the loop and the model's world narrows toward the bland, common middle, losing the diversity that made it capable. Like a photocopy of a photocopy, each pass loses detail that can never be recovered from within the loop.

This is the central cautionary tale of synthetic data. The output of a model is not a substitute for contact with reality; it is a derivative of it. Cut the connection to real, diverse, human-grounded data entirely and you risk slowly draining the system of exactly what made it good.

Using synthetic data without poisoning the well

The practitioners who use synthetic data well treat it as a supplement, not a replacement, and they keep a tether to reality.

Mix in real data. Keep genuine, diverse data in the training mix so the model stays anchored and the tails do not vanish.
Filter aggressively. Synthetic data is only as good as its quality control. Generating a lot and keeping the verifiably good fraction is where much of the value lives.
Ground generation in something real. Have the generator work from real documents, real constraints, or a checkable signal, rather than spinning text out of nothing.
Watch for narrowing. Monitor diversity, not just average quality. A dataset that looks clean but has lost its variety is a warning sign of the collapse dynamic taking hold.

Done this way, synthetic data is an amplifier of real data rather than a replacement for it — and the difference between those two framings is the difference between a powerful technique and a slow failure.

The takeaway

Synthetic data is the practice of using models to generate the data that trains models, and it has become essential because real, labeled, high-quality data is scarce, expensive, and constrained. It works because recognizing, filtering, and restructuring can extract more reliable knowledge than raw generation alone, and because a strong model's output can teach many smaller ones. But it carries a real hazard: cut off from real data and looped on itself, a model trained on model output drifts toward blandness in model collapse. The discipline is to keep synthetic data tethered to reality — mixed with real examples, filtered hard, and grounded in something checkable — so it amplifies what you have instead of slowly eroding it.

#synthetic-data#training#data#model-collapse

Primary sources

arXiv Hugging Face documentation