Tokens and tokenization: why models see text strangely

Models don't read letters or words — they read tokens. Understanding that one fact explains spelling slips, odd costs, and why context limits work as they do.

models2026-05-14 16:37 KST·Lead Editor·7 min read

When you type a sentence to a language model, you see words. The model does not. Before any "thinking" happens, your text is chopped into pieces called tokens, and it is those tokens — not letters, not words — that the model actually processes. This single translation step explains a surprising number of otherwise baffling behaviors: why a model might miscount the letters in a word, why some languages cost more to process than others, why your input can hit a length limit sooner than you expected, and why pasting a weird string sometimes produces weird output. Once you understand tokenization, a lot of model quirks stop being mysteries.

What a token actually is

A token is a chunk of text — usually a common word, a piece of a word, a space-plus-word, or a single character. It is the unit the model reads and writes. A short common word like "the" is often one token. A longer or rarer word might be split into several: something like "tokenization" could become "token" + "ization," and an unusual name might shatter into many small pieces.

The key insight is that tokens are not the same as words and not the same as characters. They sit in between. The model's entire view of language is built out of these chunks. When it generates a response, it is producing one token at a time, each chosen based on the tokens that came before. There is no point at which it works with letters or whole sentences as primary units.

Why split text up at all

It would seem simpler to feed the model whole words, or individual characters. Both extremes cause problems, and tokenization is the compromise.

If you used whole words, your vocabulary would be enormous and you would still constantly meet words you had never seen — typos, new slang, technical terms, names. The model would have no way to handle them.

If you used single characters, the vocabulary would be tiny and nothing would ever be unknown, but every piece of text would become a very long sequence, and the model would have to learn meaning from scratch out of raw letters. That is wasteful and slow.

Tokenization splits the difference. Common words get their own tokens for efficiency. Rare words get broken into smaller reusable pieces, so the model can handle anything by assembling familiar fragments — even a word it has never seen, because it has seen the fragments. This is why models cope gracefully with novel words: they were built to reassemble meaning from sub-word parts.

Why models "see text strangely"

Here is the crucial consequence: because the model operates on tokens, certain tasks that are trivial for a human become oddly hard for the model.

Consider counting the letters in a word, or reversing it, or noticing that two words rhyme. To you these are about individual letters. But the model may have received the whole word as a single token — an opaque chunk with no visible internal letters. Asking it to count the r's in a word is like asking someone to count the letters in a symbol they only recognize as a whole shape. The information is technically recoverable, but it cuts against how the model represents text. This is the real reason behind many "the AI can't spell" anecdotes. It is not stupidity; it is that letter-level structure is partly hidden by the very units the model reads.

The same effect explains why models can be shaky with precise character manipulation, certain arithmetic written out digit by digit, and tasks that depend on the exact internal composition of a string rather than its meaning.

Why the same text costs different amounts

Tokens are also the unit of measurement and billing. Model usage and pricing are typically counted in tokens, not words or characters. This has practical consequences worth internalizing.

Different languages tokenize with very different efficiency. Text in a language well represented in the tokenizer's training tends to pack into fewer tokens per idea, while other languages — or scripts that the tokenizer handles less efficiently — may need many more tokens to express the same content. The result is that the identical meaning can cost noticeably more to process in one language than another. The same goes for content like code, structured data, or text full of unusual symbols: it can fragment into more tokens than plain prose of the same visible length.

A rough rule of thumb often cited for typical English prose is that a token corresponds to a little less than a word on average — but treat any such ratio as a loose guide, not a constant. The only reliable way to know a token count is to measure it with the specific model's tokenizer, since each model family can tokenize differently.

Tokens and the context window

Every model has a context window: the maximum number of tokens it can take in and produce for a single exchange, input and output combined. That limit is measured in tokens, which is why the same window can feel larger or smaller depending on what you put in it.

This is also why long-document tasks need care. A document that looks moderate on screen might consume far more of the window than you guessed, especially if it is in a verbose-to-tokenize language or full of formatting and symbols. When you are designing anything that handles large inputs, thinking in tokens rather than pages or characters keeps you from being surprised by a truncated input or a request that silently exceeds the limit.

Practical implications

A few habits follow naturally once tokens are part of your mental model:

Don't ask models to do letter-level surgery casually. Counting characters, reversing strings, and similar tasks fight the token representation. If you need them done reliably, lean on a tool rather than the model's intuition.
Estimate length in tokens, not words, when you are near a context limit or watching costs — and measure rather than guess for anything important.
Expect cost and length to vary by language and content type, and budget accordingly rather than assuming parity across languages.
Don't over-trust visual length. A short-looking blob of code or symbols can be token-heavy; a long stretch of plain prose can be lighter than it appears.

The takeaway

Tokens are the hidden layer between your text and the model. Everything the model reads and writes is made of these chunks — usually words and word-fragments — chosen as a compromise between unwieldy whole-word vocabularies and inefficient character-by-character processing. That compromise is what lets models handle any text gracefully, but it also hides letter-level detail, which is why precise spelling and character tasks trip them up. It makes tokens the natural unit for measuring length, cost, and context limits, and it explains why the same idea can cost more in one language than another. You will rarely need to inspect tokens directly, but keeping them in mind turns a whole class of strange model behavior into something predictable.

#tokens#tokenization#context-window#text-processing

Primary sources

OpenAI — Platform Documentation Hugging Face — Documentation