How models are evaluated: benchmarks, and why they lie

Benchmark scores look like measurements, but they are arguments. Here is how model evaluation actually works, and why a high number can still mislead you.

research2026-05-06 16:14 KST·Lead Editor·7 min read

A benchmark score looks like a measurement. It has a number, a leaderboard, a winner. But a benchmark is closer to an argument than a measurement: it claims that performance on one carefully chosen task tells you something about ability in general. Sometimes that claim holds. Often it does not. Understanding how models are evaluated — and where the reasoning breaks — is what separates reading a leaderboard from being fooled by one.

This is not an argument that benchmarks are useless. They are essential; without shared tests, every claim about model quality would be marketing. The point is to read them the way a careful person reads any statistic: knowing what it measures, what it leaves out, and how it can quietly mislead.

What a benchmark really is

Strip away the leaderboard and a benchmark is three things: a fixed set of tasks, a way to run a model on them, and a rule for scoring the answers. That is it. The score summarizes how the model did on those specific tasks under that specific scoring rule.

The leap — and it is a leap — is from "did well on these tasks" to "is good at this kind of thing." That generalization is only as strong as the benchmark is representative of the real work you care about. A coding benchmark made of self-contained puzzles may say little about maintaining a large messy codebase. A reading test of short clean passages may say little about long, contradictory documents. The number is real; the generalization is a hypothesis.

It is worth pausing on who builds benchmarks and why. Some are academic efforts to track progress on a research question. Some are built by the very teams whose models are being ranked. None are neutral acts of nature: each one encodes a choice about what counts as good, what tasks deserve attention, and what gets ignored. When you read a score, you are also reading the values of whoever decided the test was worth making. That does not make benchmarks dishonest — but it does mean a benchmark measures what its authors thought mattered, which may not be what matters to you.

Why a single number hides more than it shows

Leaderboards compress a model into one figure so it can be ranked. Compression is the whole point and also the whole danger. Two models with the same headline score can differ enormously in where they succeed and fail — one steady across the board, the other brilliant on easy items and helpless on hard ones, averaging to the same place.

A single number also erases the questions that usually matter most: How does it behave at the edges? How does it fail — gracefully, or with confident nonsense? Is it consistent across rephrasings of the same task? None of that survives the collapse to one digit. This is why holistic evaluation efforts argue for reporting many dimensions — accuracy, robustness, calibration, and more — rather than a single rank. A model is a surface, and a leaderboard photographs it from one angle.

Contamination: when the test leaks into training

The most corrosive problem in model evaluation is contamination: the test questions, or close cousins of them, appearing in the model's training data. Models train on enormous swaths of the public internet, and popular benchmarks live on that same internet. When a model has effectively seen the answers, a high score measures memorization, not ability — like a student who got the exam in advance.

Contamination is hard to detect and hard to rule out, which is why a striking benchmark result deserves a specific question: could the model have seen this before? It also explains why fresh, held-out, or frequently rotated tests are valued — and why a model that dominates an old public benchmark but stumbles on a freshly written equivalent should make you suspicious rather than impressed.

Teaching to the test

Even without leaked answers, benchmarks distort what they measure. Once a benchmark becomes the scoreboard everyone watches, effort flows toward raising that score — sometimes by genuinely improving the model, sometimes by optimizing for the benchmark's quirks. The result is a model tuned to look good on the test while the underlying ability it was meant to track lags behind.

This is an old idea: once a measure becomes a target, it stops being a good measure. AI is unusually exposed to it because benchmarks are public, competition is fierce, and the gap between "good at the test" and "good at the task" is easy to ignore when a number is going up. Rising scores can mean rising ability, or rising skill at the test. The leaderboard cannot tell you which.

You can see the effect over time. A benchmark that genuinely challenged models a while ago becomes a benchmark everyone scores near the top of — not necessarily because the underlying problem was solved, but because the test became a known quantity that effort flowed toward. When a benchmark saturates, the interesting information is gone: it can no longer separate good from great, and the field moves on to a harder test. That cycle is healthy, but it is also a reminder that a maxed-out benchmark tells you almost nothing, and that yesterday's hard test is often today's solved-for-show.

What scoring leaves out

How answers are scored shapes what a benchmark can even see. Tasks with one clear right answer — a multiple-choice item, an exact match — are easy to grade and dominate benchmarks for that reason. But much real-world work has no single right answer: writing well, explaining clearly, being appropriately cautious, handling an ambiguous request. These resist automatic scoring, so they are under-measured, and under-measured qualities get under-optimized.

When the grader is itself a model, new distortions appear: it may favor a certain style, length, or confidence regardless of correctness. So before trusting a score, ask what the scoring rule could even detect. A benchmark is blind to everything its grader cannot see, and that blind spot is often exactly the part of the job that matters most.

How to read a leaderboard honestly

A few durable habits keep benchmarks useful instead of misleading:

Ask what the tasks are, not just what the score is. A number means nothing until you know what it summarizes.
Distrust tiny gaps. Small differences near the top are often noise, not a real ordering.
Prefer many dimensions to one rank. Robustness and failure behavior often matter more than peak accuracy.
Suspect contamination on any familiar public benchmark, especially when results look too clean.
Trust your own task most. The only evaluation that truly matters is on examples that resemble your actual work.

The last point is the most important and the most ignored. A public leaderboard is a starting filter, not a verdict. Your problem is the real benchmark.

The takeaway

Benchmarks are arguments dressed as measurements. They are indispensable — but a score tells you how a model did on specific tasks under a specific scoring rule, and the leap to "good in general" is a hypothesis you have to check. Contamination, teaching to the test, and one-number compression all let a high score outrun real ability. Read benchmarks the way you read any statistic: ask what it measures, what it hides, and whether it reflects the work you actually need done. Then run your own.

#benchmarks#evaluation#leaderboards#measurement

Primary sources

Stanford CRFM — HELM (Holistic Evaluation of Language Models)NIST — AI evaluation and measurement