Evaluation beyond benchmarks: human and model judges
Benchmarks measure what is easy to score. For open-ended work you need judgment — from people, or from a model standing in for them. Both can mislead.
For a long time, progress in machine learning was measured by benchmarks: fixed datasets with known right answers, where a model's score is simply how often it gets them. Benchmarks are wonderful when they apply. They are objective, repeatable, and comparable. The trouble is that the most interesting things models do now — write an essay, explain a concept, draft code, hold a helpful conversation — have no single right answer to check against. Evaluating that kind of work requires judgment, and judgment is messy.
This piece is about how the field copes: when benchmarks run out, you turn to judges. Sometimes those judges are people. Increasingly, they are other models. Both approaches are useful, and both can quietly lead you astray.
Why benchmarks stop being enough
A benchmark works when correctness is well-defined. Did the model label the image correctly? Did it solve the equation? You can score that automatically and trust the number.
Open-ended tasks break this. Suppose two models each write a summary of an article. Which is better? "Better" now depends on accuracy, completeness, clarity, tone, length, and whether it left out something important — a bundle of qualities no exact-match score captures. You could invent a proxy metric, like overlap with a reference summary, but that rewards surface similarity rather than genuine quality, and a great summary that happens to be worded differently scores poorly.
There is also a subtler failure: benchmarks can be gamed and saturated. Once a benchmark becomes a target, systems get optimized for that specific test, and high scores stop reflecting general ability. A model can ace a benchmark and still be unpleasant or unreliable in real use. So the field reaches for evaluation methods that look more like how a human would actually judge the output.
Human evaluation: the gold standard, with caveats
The most direct way to judge open-ended quality is to ask people. Show humans the model's output and have them rate it, or show them two outputs and ask which they prefer. Preference comparisons are popular because "which of these is better?" is a far easier and more reliable question for a person than "score this from one to ten."
Human judgment is the closest thing we have to ground truth for subjective quality, and it underpins a great deal of how modern models are aligned to be helpful. But it is not a clean signal:
- It is slow and expensive. People are far costlier than an automated metric, which limits how much you can evaluate.
- It is inconsistent. Different people disagree; the same person disagrees with themselves on different days. You need many ratings to average out the noise.
- It is biased in predictable ways. Raters can favor longer answers, more confident-sounding answers, or better-formatted ones — even when those are not actually better. They can be swayed by fluent prose that is subtly wrong.
So human evaluation is the gold standard and a flawed instrument at the same time. The discipline is in designing the questions well, collecting enough ratings, and watching for the biases you know are lurking.
The model as judge
Because human evaluation is so costly, a natural idea has taken hold: use a capable model to do the judging. Give a strong model the task, the candidate answer (or two answers to compare), and a rubric, and ask it to score or pick a winner. This is usually called LLM-as-judge.
The appeal is obvious. A model judge is fast, cheap, available around the clock, and perfectly consistent in the narrow sense that it follows the same instructions every time. It can evaluate thousands of outputs in the time a human panel handles a handful, which makes it practical to test changes that would otherwise be too expensive to measure. For many open-ended tasks, a strong model's preferences line up reasonably well with what people prefer — well enough to be genuinely useful for rapid iteration.
This has become a workhorse of modern evaluation precisely because it unblocks the bottleneck. But it comes with its own catalog of hazards, and treating a model judge as an oracle is a recipe for fooling yourself.
How model judges mislead you
A model judge has biases, and because it is automated, those biases apply systematically to every single judgment — which can be worse than human noise that at least averages out.
- Position and ordering effects. When comparing two answers, a judge can favor whichever was shown first (or last), regardless of content. Swapping the order and averaging is a standard precaution.
- Verbosity and style bias. Model judges often prefer longer, more elaborate, more confident-sounding answers, even when a short correct answer is better. Polished form can beat correct substance.
- Self-preference. A judge can favor outputs that resemble its own style or that it would have produced itself, which skews comparisons between models.
- Susceptibility to the question's framing. How the rubric is worded can swing the verdict, so the prompt to the judge is itself a design artifact you have to get right.
The deepest risk is circularity: if you use a model to judge a model, and both share the same blind spots, the judge will happily rate confident nonsense as excellent because it shares the same misconceptions. The evaluation looks rigorous and measures the wrong thing.
Making judges trustworthy
None of these problems mean you should abandon model judges; they mean you should treat their output as evidence, not verdict. Practices that help:
- Validate the judge against humans. Periodically check that the model judge's verdicts agree with careful human judgment on a sample. If they diverge, trust the humans and recalibrate.
- Control for known biases. Randomize answer order, watch whether the judge just rewards length, and design rubrics that ask for specific criteria rather than a vague overall vibe.
- Use clear, concrete rubrics. A judge told exactly what to look for is more reliable than one asked an open-ended "which is better?"
- Keep humans in the loop for high stakes. Use cheap model judgment to iterate fast, and reserve human evaluation for the decisions that actually matter.
The goal is a layered system: automated judgment for speed and scale, anchored by periodic human judgment for ground truth.
The takeaway
Benchmarks measure what is easy to score, and the most valuable things models do are not easy to score. That pushes evaluation toward judgment — from people, who are the gold standard but slow, inconsistent, and quietly biased, and from models acting as judges, which are fast and cheap but carry systematic biases of their own and risk a circular trap where a model rewards its own blind spots. Neither judge is an oracle. The reliable path is to use model judges for scale, validate them against humans, control for the biases you know exist, and keep human judgment anchoring the decisions that count. Good evaluation is not one number — it is knowing how much to trust the number you have.
