Tokenizers and why they matter for languages

A language model never sees words. It sees tokens. How text gets chopped into tokens quietly decides cost, speed, and fairness across languages.

research2026-05-05 08:17 KST·Lead Editor·7 min read

A language model does not read words, and it does not read letters. Before any text reaches the model, it passes through a tokenizer that chops it into a sequence of units called tokens — and the model only ever sees those tokens. This step is so far upstream that it is easy to ignore, but it shapes almost everything downstream: how much a request costs, how long the model's memory effectively is, and, strikingly, how fairly the model treats different human languages.

The tokenizer is the model's senses. If you want to understand why the same idea can cost twice as much to express in one language as in another, you have to start here.

What a token actually is

A token is a chunk of text, and it is usually not a whole word. Modern tokenizers tend to break text into subword pieces. Common words might be a single token, while rarer words get split into several. A made-up or unusual string might be broken down nearly character by character. Spaces and punctuation are tokens too, and capitalization can change how something is split.

Why subwords rather than whole words? Because a vocabulary of whole words would be enormous and would still miss every new word, name, or typo. And why not just use individual characters? Because that makes sequences extremely long and forces the model to reassemble meaning from tiny fragments. Subword tokenization is the compromise: a fixed, manageable vocabulary that can still represent any input by combining pieces. It captures common patterns efficiently while never being completely stumped by something it has not seen.

How the vocabulary gets built

A tokenizer's vocabulary is not handwritten — it is learned from a large corpus of text before the model is ever trained. The general principle behind the popular methods is the same: start small and merge what occurs together often.

A typical approach begins with the basic characters, then repeatedly looks for the most frequent adjacent pair and merges it into a new unit, adding that unit to the vocabulary. Do this many times and frequent sequences — common prefixes, suffixes, whole words that appear constantly — become single tokens, while rare sequences stay broken into smaller parts. The result is a vocabulary tuned to the statistics of the training text.

That last phrase is the crux of the whole story. The tokenizer is tuned to the text it was built from. Whatever that text was rich in gets short, efficient tokens. Whatever it was poor in gets chopped into many small pieces.

Why this is not fair across languages

Most large tokenizers are trained on corpora dominated by a handful of widely written languages. Those languages — and especially English — end up with efficient tokenization: common words become single tokens, and a sentence turns into relatively few tokens.

Languages that were underrepresented in that corpus fare worse. The same meaning, expressed in such a language, can require noticeably more tokens, because the tokenizer never learned compact units for it and falls back to splitting words into many small fragments. Writing systems with large character sets, or scripts the tokenizer saw little of, can be hit especially hard, sometimes approaching one token per character.

This is not a small cosmetic difference. It has direct, compounding consequences:

Cost. Models are typically priced and metered per token. If your language needs more tokens to say the same thing, the same conversation simply costs more.
Effective memory. A model's context window is measured in tokens. More tokens per sentence means fewer sentences fit, so the model effectively remembers less of your document in a token-inefficient language.
Speed. More tokens to read and generate means more compute per request and slower responses.

So a speaker of an underrepresented language can pay more, get shorter effective memory, and wait longer — for identical content. The unfairness is baked in below the model, at the tokenizer.

The downstream effects on quality

Tokenization can shape capability, not just cost. When words are shattered into many fragments, the model has to do more work to reassemble meaning, and patterns that would be obvious at the word level get spread thin across many tokens. Tasks that hinge on the exact structure of text — counting, spelling, manipulating characters, careful arithmetic — can stumble in surprising ways precisely because the model sees tokens, not the letters and digits a human sees.

This explains a class of behaviors that otherwise look baffling. When a model miscounts the letters in a word, it is not being dim; it never had the letters as clean separate units in the first place. The tokenizer handed it chunks, and the chunks hid the detail the task required.

What gets done about it

There is no perfect fix, but there are levers. Building the tokenizer's vocabulary from a more balanced, multilingual corpus gives underrepresented languages a fairer share of efficient tokens. Making the vocabulary larger leaves room for more languages to get compact units, at the cost of a bigger model component. Some systems are designed from the start to be multilingual and weight their tokenizer training accordingly.

None of these fully erase the gap, because any fixed vocabulary reflects priorities — you cannot give every script the most efficient possible encoding at once. But being deliberate about the tokenizer is one of the highest-leverage fairness decisions in the whole pipeline, precisely because it sits upstream of everything else.

How to think about it as a user

You rarely control the tokenizer, but you can reason about it. If you work in a language that tokenizes inefficiently, expect higher token counts, plan for it in budgets and context limits, and be aware that very character-sensitive tasks may be shakier. When comparing the cost of two models, remember that token counts for the same text can differ between them, because each ships its own tokenizer. The "price per token" headline means little without knowing how many tokens your actual text becomes.

The takeaway

The tokenizer is the invisible layer that turns human text into the units a model actually consumes, and it is built from whatever corpus it was trained on. That single fact ripples outward: languages well represented in that corpus get cheap, compact, capable treatment, while underrepresented ones pay more, fit less in context, and run slower for the very same meaning. Tokenization is not a technicality to skip over — it is where a lot of a model's cost structure and a lot of its fairness are quietly decided, long before the model itself does any thinking.

#tokenization#languages#nlp#fairness

Primary sources

Hugging Face documentation arXiv