Measuring quality: how to set up a basic eval

Vibes don't scale. A small, honest evaluation turns 'this feels better' into a number you can trust — here's how to build one from scratch.

tutorials2026-05-01 11:01 KST·Lead Editor·7 min read

Most people building with language models measure quality the same way: they try a change, look at one or two outputs, and decide it "feels better." That works until it doesn't — until you have two prompts and no way to say which is actually better, or you ship a tweak that fixes one case and quietly breaks five others. The fix is an eval: a small, repeatable test that turns a vague sense of quality into a number you can compare. You do not need a framework or a platform to start. You need a handful of examples and the discipline to score them the same way twice. This walkthrough builds that from nothing.

Why "it feels better" fails

Judging a model by a single output has three problems, and each one is fatal on its own. It does not generalize — the impressive demo says nothing about the next hundred inputs. It is not repeatable — your impression depends on which example you happened to look at and what mood you were in. And it cannot detect regressions — when a change improves one case and worsens another, eyeballing one output at a time will never show you the trade. An eval solves all three by fixing the inputs, fixing the scoring, and looking at the whole set rather than a lucky sample.

The goal is not a perfect, scientific benchmark. It is something better than vibes: a measurement honest enough that when it says version B beats version A, you believe it.

Step 1: A dataset of real examples

An eval starts with a set of inputs that represent the work your system actually does. This is the most important step and the one people most want to skip. Twenty to fifty examples is plenty to begin — quality matters far more than quantity. What matters is that they are representative: drawn from real or realistic usage, covering the easy cases, the common cases, and especially the hard and weird ones.

Include the tricky inputs deliberately. The empty input, the ambiguous question, the request that should be refused, the case that broke last week. An eval built only from easy examples will report that everything is fine right up until a real user sends something hard. Your dataset is a small museum of the situations you care about getting right, and it should over-represent the situations you are afraid of.

Step 2: Decide what "good" means

Before you can score anything, you have to say what a good output is for your task — and this is harder and more valuable than it sounds, because it forces you to make the standard explicit. Different tasks demand different definitions:

Exact-match tasks have one right answer: a classification label, an extracted field, a yes/no. Good means the output equals the expected answer.
Structured tasks care about form: valid JSON, the required fields present, the correct shape. Good means it parses and conforms.
Open-ended tasks — summaries, explanations, drafts — have no single right answer. Good is defined by criteria: is it accurate, does it stay on topic, does it avoid inventing facts, is it the right length and tone?

Write your definition down before you look at any results. Deciding what "good" means after seeing the outputs is how you talk yourself into the conclusion you wanted. The definition is your fixed yardstick, and a yardstick that bends to fit the answer measures nothing.

Step 3: Choose how to score

With a definition in hand, you need a method to apply it consistently to every example. There are three honest options, roughly in order of how much they cost you.

Programmatic checks are best when they apply. If there is a right answer or a required format, a few lines of code can compare the output to the expected result, or check that the JSON parses and has the right fields. This is fast, free, perfectly repeatable, and not subject to anyone's mood. Use it wherever the task allows.

Human judgment is the fallback for open-ended work that resists automatic checking. You read each output and score it against your written criteria. To keep it honest, score on a simple, defined scale rather than a fuzzy impression — even "pass / borderline / fail" with a one-line rule for each beats a gut feeling. The trap is inconsistency: the same person scores the same output differently on a tired afternoon. Writing the criteria down is what holds the scoring steady across a session.

Model-as-judge uses a second language model to score the outputs against your criteria, which scales human-style judgment to many examples cheaply. It is genuinely useful and genuinely fallible — the judge has its own biases and can be fooled by confident, fluent, wrong answers. If you use it, validate it: have a human score a sample and check the judge agrees. A judge you have never checked is a number you should not trust.

Step 4: Run it and read the failures

Now you have the pieces: a dataset, a definition of good, a scoring method. Run every example through your system, score each one, and compute a simple summary — what fraction passed, the average score, however your method aggregates. That single number is your baseline. By itself it means little; its value appears the moment you change something and run again, because now "feels better" becomes "went from 71% to 78%," and that is a claim you can stand behind.

But the number is the smaller half of the payoff. The larger half is reading the failures. Pull every example that scored poorly and look for the pattern: a category of input the system mishandles, a recurring mistake, a kind of question it consistently fumbles. The aggregate score tells you whether things are working; the failures tell you what to fix. A baseline that hides its failures is a thermometer with no diagnosis. Always read the bottom of the distribution, not just the average.

Step 5: Change one thing at a time

An eval earns its keep when you use it to compare. Make a single change — a different prompt, a new instruction, another model — and rerun the same dataset with the same scoring. If the number goes up and no failure category got worse, keep the change. If it goes down, you just caught a regression before your users did, which is the whole point.

The discipline is changing one variable per run. Change three things at once and a better score tells you nothing about which one helped, or whether two helped while the third hurt. One change, one rerun, one comparison. It is slower than it feels like it should be, and it is the only way the number stays meaningful.

The takeaway

A basic eval is four honest pieces: a small set of representative examples, a written definition of what good means, a consistent way to score, and the habit of changing one thing at a time and rerunning. You can build it with a spreadsheet and an afternoon. It will not be a perfect benchmark, and it does not need to be — it only needs to be better than vibes, repeatable enough to trust, and revealing enough to show you where the failures cluster. Once "it feels better" becomes a number you can defend, every later decision gets easier and more honest.

#evaluation#testing#quality#tutorial

Primary sources

Anthropic — define success criteria and develop tests OpenAI — evaluation guide