The cost of a token: how model pricing works
"Model bills are measured in tokens, not words or requests. Understanding what a token is, and which ones you pay for, is how you keep costs predictable."
When you use a model through an API, you are not billed per request or per word. You are billed per token. Tokens are the unit the model actually reads and writes, and almost every surprise on a model bill comes from not understanding what they are and which ones you pay for. The good news is that the pricing model is simple once you see its shape, and a handful of habits keep costs predictable. The bad news is that the parts that drive your bill are often invisible in the text you see on screen.
This piece explains what a token is, why input and output are priced differently, where hidden tokens hide, and how to estimate and control what you will spend — all in principles that stay true regardless of which provider or model you use.
What a token actually is
A token is a chunk of text — roughly a word, but not exactly. Models do not read letters or whole words; they break text into pieces that sit somewhere in between. A common short word might be a single token, while a longer or less common word might split into two or three. Spaces, punctuation, and formatting count too. The rough rule of thumb people use is that a token is a little shorter than a word on average, but the only way to know exactly is to let the model's tokenizer count.
The important consequence is that token count does not map neatly to your intuition about length. Dense, technical, or non-English text can use more tokens per word than plain English prose. Code, with its punctuation and symbols, can be token-heavy. So "how long is my text" is the wrong question; "how many tokens is my text" is the one your bill cares about, and the two can diverge more than you expect.
Input tokens and output tokens are priced differently
Every interaction with a model has two token streams, and they almost always cost different amounts. Input tokens are everything you send in — the prompt, the instructions, any documents or examples you include. Output tokens are everything the model generates back. Providers price these two separately, and output tokens are typically the more expensive of the two.
The reason is rooted in how generation works. Reading the input is a single pass; the model takes it all in and processes it. Producing the output happens one token at a time, each step a fresh computation that depends on everything generated so far. That step-by-step production is the costly part, which is why generated output usually carries a higher price than the input it responded to. Knowing this changes how you optimize: a long output is often a bigger lever on your bill than a long input.
The hidden tokens that drive your bill
The most common billing surprise comes from tokens you never explicitly typed. Three sources stand out.
First, system instructions and context you resend. In a conversation, the model has no memory between turns — so to keep continuity, applications resend the prior conversation and any standing instructions with every single request. The cost of that history is paid again on each turn. A long conversation gets more expensive per message as it grows, because each new message drags the whole transcript along as input.
Second, retrieved or attached content. When you give a model documents to work from, every one of those tokens is input you pay for. A feature that stuffs large documents into the prompt can quietly cost far more per call than the user's short question would suggest.
Third, the model's own intermediate work. Some models produce internal reasoning before their final answer, and that intermediate text is generated output you are typically billed for even when it is not shown to the user. A short visible answer can sit on top of a much larger volume of paid-for generation.
Why the context window matters for cost
Every model has a maximum amount of text it can consider at once — its context window. It is tempting to treat a large context window as free room to dump everything in, but the window is a capacity, not a budget. You still pay for every token you place inside it. Filling a large window to the brim means paying for a large input on every call.
The window does impose a hard ceiling: input plus output cannot exceed it. But the practical discipline is to use far less than the maximum. The fewer tokens you send to accomplish the task, the less each call costs and, often, the faster it returns. A large window is a convenience for the occasional big job, not a license to be wasteful on the routine ones.
Estimating and controlling spend
You can forecast costs before you ever ship. The arithmetic is straightforward: estimate the typical input tokens and output tokens per call, multiply each by its respective price, and multiply by how many calls you expect. Doing this on the back of an envelope before building catches expensive designs while they are still cheap to change.
To control spend once you are running, a few habits do most of the work. Trim what you send — drop conversation history you do not need, summarize long context instead of resending it verbatim, and include only the documents that matter. Cap output length when the task allows, since output is the pricier stream. Reach for a smaller, cheaper model for routine work and reserve the expensive one for the calls that genuinely need it. And measure real usage rather than trusting estimates, because actual token counts on real traffic are the only numbers that pay the bill.
A quick worked intuition
Imagine a support assistant. A user types a one-line question — tiny input. But your system also sends a page of standing instructions, the last several turns of the conversation, and three retrieved help articles. The user's visible words are a rounding error; the real input is the instructions, history, and articles, repeated on every turn. If the assistant then writes a thorough multi-paragraph reply, that output may cost more than all the input combined. Seeing the call this way — most of the cost in places the user never sees — is the whole insight. Optimizing the visible question would save nothing; trimming the invisible context and the output length is where the money is.
The takeaway
Tokens, not words or requests, are what you pay for — and the tokens that drive your bill are usually the ones you do not see: resent conversation history, attached documents, standing instructions, and the model's own intermediate generation. Input and output are priced separately, with output typically costing more. A large context window is capacity you pay to fill, not free space. Estimate with simple arithmetic before you build, trim what you send and cap what you generate, and measure real usage. Token pricing rewards the people who know exactly what they are sending and punishes the ones who do not.
