Reasoning models: what "thinking" tokens do
"Reasoning models work through a problem before answering. That hidden working costs time and tokens — and pays off only on the right kind of task."
A newer family of models is often described as "reasoning" or "thinking" models, and the name does real work in describing them — but it can also mislead. These models do not think the way a person does. What they do is spend extra generation working through a problem, step by step, before committing to a final answer. That intermediate working is sometimes called "thinking tokens," and it is the defining feature of the category. It can dramatically improve answers on certain problems and add cost and latency for no benefit on others. Knowing the difference is what separates using these models well from overpaying for them.
This piece explains what the thinking step actually is, why it helps, what it costs, and how to decide when a reasoning model is the right tool rather than just the more expensive one.
The difference from a standard model
A standard model, given a question, begins producing its answer immediately, generating the response token by token from the first word. A reasoning model inserts a phase in between. Before it writes the answer you see, it generates a stretch of intermediate text — laying out the problem, considering steps, working things through. Only after that working does it produce the final response.
That intermediate text is the "thinking." Often it is hidden from the user and only the final answer is shown, but it is still generated, which means it still takes time and still costs tokens. The mental model to hold is simple: a standard model answers; a reasoning model works first, then answers. Everything distinctive about the category — its strengths, its costs, its right uses — follows from that one extra phase.
Why working through a problem helps
The reason this extra step improves answers comes back to how generation works. A model produces each token based on everything before it, so the text already on the page shapes what comes next. When a model jumps straight to an answer on a hard, multi-step problem, it is committing to a conclusion before it has laid down the intermediate steps that would support it — and once an early token goes wrong, everything after it builds on the mistake.
By generating its working first, a reasoning model gives itself those intermediate steps to build on. Each step becomes context for the next, so a complex problem gets decomposed into a chain of smaller moves rather than attempted in one leap. This is why the gains show up most on problems that genuinely have multiple steps — math, logic, careful analysis, intricate code — where the answer depends on getting a sequence of sub-conclusions right. The working is not decoration; it is the scaffolding the final answer stands on.
What it costs
The thinking phase is not free, and its costs are exactly the costs of generation, because that is what it is. Two of them matter.
The first is latency. Generating the working takes time before the answer appears. A reasoning model is slower to respond than a standard one on the same question, sometimes substantially, because it is producing a whole stretch of text the user never asked to read. For anything interactive where speed matters, that delay is a real tax.
The second is token cost. The thinking tokens are generated output, and generated output is typically billed even when it is hidden from the user. So a reasoning model can cost considerably more per question than a standard model, because you are paying for all the working in addition to the final answer. A short visible response can sit atop a large, paid-for body of hidden reasoning. Neither cost is a flaw — they are the price of the extra phase — but they only pay off when the phase actually improves the answer.
When a reasoning model is worth it
The decision rule follows directly from the trade-off: use a reasoning model when the problem's difficulty justifies the extra time and tokens, and not otherwise. Some questions are genuinely hard and multi-step — a tricky logical deduction, a math problem, a complex piece of analysis, code that has to satisfy several interacting constraints. On these, the working materially improves correctness, and the added cost buys a better answer. This is where reasoning models shine.
Many questions are not like that. Pulling a fact from a document, rephrasing a sentence, classifying a short piece of text, answering something simple and direct — these do not have multiple steps to work through, so the thinking phase adds latency and cost while changing the answer little or not at all. Using a reasoning model here is overkill: you pay the premium and wait longer for an answer a standard model would have produced just as well, faster and cheaper. The waste is invisible until you look at the bill and the response times.
The thinking is not a window into truth
It is tempting to read a reasoning model's working as a transparent explanation of how it reached its answer — a justification you can trust. Be careful. The thinking text is itself generated output, produced by the same probabilistic process as everything else. It often does reflect a genuine working-through that helps the model, but it is not a guaranteed, faithful log of the model's internal computation, and it can contain steps that look reasonable yet are wrong. Treat the working as useful context and a debugging aid, not as proof. A confident chain of reasoning can still arrive at a confident mistake, and the presence of detailed working is not by itself evidence the answer is correct.
How to choose in practice
The practical approach mirrors evaluating any model: test on your own task rather than assume. Take a representative set of the problems your application actually handles and compare a reasoning model against a standard one on exactly those inputs, watching three things at once — answer quality, latency, and token cost. If the reasoning model's quality gain on your problems is large enough to justify the slower, pricier responses, it earns its place. If the quality is similar, the standard model is the better choice and the reasoning premium is pure waste.
Often the best design is to route by difficulty: send the genuinely hard problems to a reasoning model and the routine ones to a standard model, so each question pays only for the working it needs. Reaching for the reasoning model by default, on every request, is the common and costly mistake — it spends time and tokens on simple questions that never needed them.
The takeaway
Reasoning models add a phase: they generate intermediate working before their final answer, and that working — the "thinking tokens" — is what makes them distinctive. It genuinely improves answers on hard, multi-step problems by giving the model scaffolding to build on, but it costs both latency and tokens, since the working is generated output you pay for even when it is hidden. Use these models where the difficulty earns the premium and a standard model where it does not, treat the visible reasoning as a helpful aid rather than guaranteed truth, and let a test on your own problems decide. Thinking is powerful precisely where thinking is required — and dead weight everywhere else.
