Temperature, top-p, and sampling: controlling model output

Temperature and top-p decide how a model picks its next word. Knowing what each one really does lets you dial output from rigid to creative on purpose.

models2026-04-06 09:43 KST·Lead Editor·7 min read

If you have ever asked a model the same question twice and gotten two different answers, you have met sampling. A language model does not deterministically output "the" right next word; at each step it produces a spread of possibilities with different likelihoods, and then something has to choose among them. The settings that govern that choice — most commonly temperature and top-p — are among the few knobs you directly control, and they have an outsized effect on whether output feels rigid, balanced, or wildly creative. Understanding them turns a frustrating "why is it being random?" into a deliberate dial you can set on purpose.

Where the randomness comes from

At every step of generating text, the model computes a probability for each possible next token. It might decide the next token is "blue" with high probability, "green" with lower probability, "purple" with lower still, and so on across its whole vocabulary. This spread is a distribution: a ranked set of candidates with attached likelihoods.

The model does not, by itself, decide which one to use. That is the job of the sampling step. The simplest possible rule would be "always take the single most likely token." That rule is called greedy decoding, and it sounds appealing — always pick the best guess — but in practice it tends to produce flat, repetitive, sometimes weirdly stuck text. Language that is good usually involves some variation, and rigidly always-most-likely text is not how good writing reads. So instead of always grabbing the top candidate, models typically sample from the distribution, and temperature and top-p shape how that sampling behaves.

Temperature: flattening or sharpening the odds

Temperature controls how much the model favors its high-probability candidates over its low-probability ones. The cleanest way to picture it: temperature reshapes the distribution before a token is drawn.

Low temperature sharpens the distribution. The already-likely tokens become even more dominant, and the long tail of unlikely options gets squeezed toward irrelevance. Output becomes more focused, more predictable, more repetitive. At the extreme, very low temperature approaches greedy behavior — it almost always takes the top candidate.
High temperature flattens the distribution. The gap between likely and unlikely tokens narrows, so less probable, more surprising tokens get a real chance of being chosen. Output becomes more varied, more creative, and — past a point — less coherent, because the model is now willing to pick tokens it considered unlikely.

A useful intuition: temperature does not give the model new ideas. It only changes how willing the model is to reach past its safest guess. Low temperature is a cautious writer who always picks the obvious word; high temperature is one who reaches for the unexpected, sometimes brilliantly and sometimes nonsensically.

Top-p: trimming the tail before choosing

Top-p, also called nucleus sampling, works differently. Instead of reshaping all the probabilities, it restricts which candidates are eligible in the first place.

The idea: line up the candidate tokens from most to least likely, and keep adding them to a shortlist until their combined probability reaches the threshold p. Everything outside that shortlist is discarded for this step, and the model samples only from the survivors. A top-p of, say, a high value keeps a broad shortlist; a lower value keeps only the few most probable tokens.

The clever part is that this shortlist resizes itself automatically. When the model is confident — one or two tokens carry most of the probability — the shortlist is tiny, and output stays on-rails. When the model is genuinely uncertain and probability is spread across many plausible tokens, the shortlist grows, allowing variety exactly where variety is reasonable. Top-p is, in effect, a dynamic way to cut off the implausible tail without forcing a fixed number of choices.

How the two relate

Temperature and top-p are often available together, and they answer two different questions:

Temperature asks: how much should I favor my confident guesses over my unsure ones?
Top-p asks: how much of the unlikely tail should I even consider?

They can be combined, but combining them aggressively can be hard to reason about, because both are loosening or tightening the same output in overlapping ways. A common, sane approach is to adjust one as your primary creativity dial and leave the other at a moderate default, rather than pushing both to extremes at once. Exact numeric ranges differ between model providers, so treat the behavior — sharper versus flatter, narrower versus broader — as the thing you are tuning, and check each provider's own documentation for the specific scale.

Matching settings to the task

The right setting is entirely about what you are doing.

When you want consistency and correctness — extracting structured data, answering factual questions, classifying text, generating code that must run — bias toward low randomness. You want the model's most confident, on-distribution answer, and you want it to be reproducible. High randomness here just invites avoidable mistakes and makes failures harder to debug.

When you want variety and creativity — brainstorming, drafting marketing copy, generating multiple distinct options, fiction — raise the randomness. The occasional odd choice is a feature; you are mining the model for range, and several different attempts are exactly the point.

A practical pattern for idea generation is to deliberately run the same prompt several times at higher randomness and pick the best result, rather than expecting one perfect output. For anything you need to be stable and testable, do the opposite: minimize randomness so the same input reliably gives the same output.

A note on reproducibility

If you need the same output every time — for testing, for caching, for auditability — high randomness works against you. Lowering temperature toward its floor pushes behavior toward deterministic, and some interfaces offer additional controls aimed at reproducibility. But be realistic: perfectly identical output across runs is not always guaranteed, and you should verify rather than assume. The general principle holds regardless: less randomness means more repeatable, more conservative output; more randomness means more varied, less predictable output.

The takeaway

Sampling is the step where a model turns its internal spread of possible next tokens into an actual choice, and temperature and top-p are how you steer it. Temperature sharpens or flattens the whole distribution — how boldly the model reaches past its safest guess. Top-p trims the unlikely tail before choosing, widening the options only when the model is genuinely uncertain. Neither adds knowledge; both shape expression. Reach for low randomness when you need correctness and consistency, higher randomness when you want range and surprise, and adjust one dial at a time so you can actually tell what changed. Used deliberately, these settings turn unpredictable output into a tool you control.

#sampling#temperature#top-p#inference

Primary sources

OpenAI — Platform Documentation Anthropic — Documentation