How large language models are trained, in plain language

Training a language model happens in stages, not one magic step. Here is what each stage does, in plain language, and why the order matters.

models2026-06-01 12:06 KST·Lead Editor·7 min read

A large language model can feel like a single, finished object — you type something, it answers. But the thing answering you was built in stages, and each stage does a different job. If you only ever picture "the AI learned from the internet," you miss why models behave the way they do: why they sometimes sound confident and wrong, why they follow instructions at all, and why two models trained on similar data can feel so different. This piece walks through the main stages in plain language, in the order they happen.

The core idea: predict the next piece of text

Strip away the jargon and a language model does one thing: it predicts what comes next. Given a stretch of text, it estimates how likely each possible next piece is, picks from those, and repeats. That is the whole mechanical job.

What makes this powerful is that predicting the next piece well turns out to require a lot of implicit knowledge. To guess the word that finishes "the capital of that country is," the model has to have absorbed something about geography. To continue a snippet of code correctly, it has to have absorbed something about syntax. Nobody programs these facts in directly. They are a side effect of getting very good at prediction over a huge amount of text. Keep this in mind: everything a model "knows" is knowledge it picked up in service of prediction, not facts it was handed as truths.

Stage one: pretraining

The first and largest stage is pretraining. The model is shown an enormous quantity of text and repeatedly asked to predict the next piece, with its internal settings nudged a little each time it guesses wrong. Over billions of these tiny corrections, it builds a statistical sense of how language works and what tends to follow what.

A few things are worth understanding about this stage:

It is self-supervised. Nobody hand-labels the data. The "right answer" for each prediction is simply the actual next piece of text, which is already there. This is why it can scale: the supervision is free.
It is broad, not curated for behavior. Pretraining data is a wide sweep of text. The model learns the patterns in that text — helpful ones and unhelpful ones alike. It has no sense yet of being an "assistant."
It is by far the most expensive stage. The heavy compute cost people associate with training models lives mostly here.

After pretraining you have a model that is fluent and knowledgeable but not especially useful to talk to. It will happily continue your text in whatever direction the patterns suggest, including ignoring your actual question to imitate the style of a question-and-answer page. It is raw capability without manners.

Stage two: teaching it to follow instructions

The next stage closes the gap between "can continue text" and "does what I asked." This is often called instruction tuning or supervised fine-tuning. The model is shown many examples of the form here is a request, here is a good response, and it learns to produce responses in that shape.

This is a smaller, more deliberate stage than pretraining. The examples are written or curated to demonstrate the behavior you want: answering directly, following formatting requests, declining things it should decline, admitting uncertainty. The model already has the underlying capability from pretraining; this stage points that capability at the job of being a helpful assistant.

The important mental shift here is that instruction following is trained, not innate. A model is not naturally inclined to answer your question rather than mimic the genre of your question. It does so because it was shown, repeatedly, that this is the expected behavior.

Stage three: learning from preferences

Demonstrations only go so far. For many requests there is no single correct answer — there are better and worse ones. To capture that, models go through a stage that learns from preferences: humans (and increasingly other models acting as graders) compare two responses and indicate which is better. The model is then adjusted to produce more of what is preferred and less of what is not.

The best-known version of this is reinforcement learning from human feedback (RLHF), though several variations exist. The mechanics differ, but the goal is the same: shape the model's tendencies toward responses people actually find helpful, honest, and appropriate, rather than just plausible.

This stage explains a lot of a model's "personality." Whether it hedges or commits, how it handles sensitive requests, how verbose it is by default — much of that is the residue of preference training, not raw knowledge. It is also where a lot of the safety behavior is instilled.

Why models still get things wrong

Understanding the stages makes the failure modes less mysterious.

A model hallucinates — states something false with confidence — partly because its core skill is producing plausible continuations, and a fluent wrong answer can be more plausible-sounding than an honest "I don't know." Training pushes against this, but it cannot fully remove a tendency baked into the objective itself.

A model has a knowledge cutoff because pretraining used data gathered up to some point; events after that simply were not in the text it learned from.

A model can be inconsistent because it is sampling from a distribution of likely continuations, not reading from a fixed database. Ask the same thing twice and the path through that distribution can differ.

None of these are bugs in the ordinary sense. They follow directly from how the thing is built.

Where evaluation and iteration fit

Training is not a straight line from start to finished product. Between and after these stages, models are evaluated — tested on tasks, probed for unsafe behavior, checked for regressions — and the results feed back into more tuning. A real model is the output of many rounds of train, measure, adjust, repeat. The clean three-stage story above is the backbone; in practice there is a great deal of iteration layered on top, much of it aimed at fixing specific weaknesses found during testing.

The takeaway

A language model is not trained in one step and it is not simply "the internet compressed." It is pretrained to predict text and absorb broad knowledge, tuned to follow instructions, and shaped by preferences toward being helpful and safe — with evaluation and iteration threaded throughout. Each stage explains something you can see in the finished product: pretraining gives it knowledge and fluency, instruction tuning gives it the habit of answering you, and preference training gives it its manners and judgment. When a model surprises you — confidently wrong, oddly cautious, stuck before recent events — you can usually trace the behavior back to one of these stages. That mental model will serve you far better than imagining a single mysterious moment where the machine "learned."

#training#pretraining#fine-tuning#rlhf

Primary sources

Hugging Face — Documentation Anthropic — Documentation