Test your prompts like code
A prompt is code that ships to users. Treat it that way — with test cases, a baseline, and a regression check before every change.
A prompt is a piece of logic that runs in production and shapes what your users see. We would never ship a function that we'd only run once, on one input, eyeballed the output, and declared good. Yet that is exactly how most prompts are developed: tweak the wording, try it on the example in front of you, see something nice, ship it. This guide is about applying the discipline we already use for code — test cases, baselines, regression checks — to prompts, because prompts deserve it and break without it.
Why the demo is lying to you
The single most dangerous habit in prompt work is judging a prompt by one impressive output. You write a prompt, run it on a handpicked input, get a great answer, and feel done. But that one input doesn't represent the range of things real users will send. The next input — phrased differently, missing a field, longer, weirder — may produce garbage, and you won't know until a user finds out for you.
The demo is misleading for the same reason a single passing run misleads in code: it tests the happy path you had in mind, not the cases you didn't. Prompts are especially prone to this because the output is fluent and confident even when it's wrong, so a bad result reads as plausible. The only defense is to stop trusting individual outputs and start measuring behavior across a set of inputs that actually looks like reality.
Build a test set that looks like reality
Testing prompts starts with assembling a collection of representative inputs — the equivalent of test cases. Pull them from real usage if you have it, and make sure the set spans the variety you actually see: typical requests, but also the short ones, the malformed ones, the edge cases, and the "there is no good answer" cases. A test set of only easy inputs tells you your prompt handles easy inputs, which you already knew.
For each input, decide what a good output looks like. Sometimes there's a single correct answer you can check exactly. More often "good" is a set of properties: the right format, the relevant facts included, nothing fabricated, the failure case handled correctly. Writing down what good means before you run anything is what turns "that looks nice" into a real pass-or-fail judgment. The test set plus its success criteria is your specification for the prompt, made concrete.
Decide how you'll grade the output
Code tests usually check exact equality. Prompt outputs rarely have one exact right answer, so you need a grading approach that fits. For structured outputs — a specific format, a required field, a value from a fixed set — you can check programmatically: does it parse, is the field present, is the value valid. These checks are cheap and worth automating because they catch the mechanical failures that matter most in an application.
For open-ended outputs, grading is judgment-based: does the answer cover the key points, avoid fabrication, hit the right tone. You can do this by reading, and for small sets reading is fine and underrated. At larger scale, a common approach is to have a model grade outputs against a rubric you write — useful, but only as good as the rubric, and worth spot-checking against your own reading. The point is to pick a grading method deliberately rather than falling back on "it looks fine," which is not a method.
Establish a baseline before you change anything
Before you start improving a prompt, run your current prompt against the full test set and record how it does. That score is your baseline — the thing every change is measured against. Without a baseline, you're flying blind: you'll make a change, see one output improve, and have no idea whether the prompt got better or worse overall. A change that fixes one case while quietly breaking three others looks like progress and is the opposite.
The baseline is what makes prompt work cumulative instead of a random walk. Each candidate change gets run against the same set and compared to the baseline. If it does better across the set, it becomes the new baseline. If it does worse, you discard it, even if it produced one output you loved. This is the core loop, and it's the same loop that makes refactoring code safe: a known-good reference you can always check against.
Change one thing at a time
When a prompt underperforms, the temptation is to rewrite half of it at once — new instructions, new examples, new format, all together. If the result is better, you won't know which change helped; if it's worse, you won't know which change hurt. Either way you've learned nothing transferable. Change one variable, run the test set, compare to the baseline, and record the result. Then change the next.
This is slower per iteration and far faster overall, because you build actual knowledge about what your prompt responds to. You learn that this example fixed the format problem, that this instruction reduced fabrication, that this restructuring did nothing. That knowledge compounds across the project and across future prompts. Scattershot rewriting, by contrast, produces a prompt that works for reasons nobody understands and that nobody can safely modify later.
Re-run the set before every change ships
The last piece is treating your test set as a regression suite. Models get updated, requirements shift, and a prompt that worked can quietly start failing — including on cases it used to handle. Every time you change the prompt, and ideally on a schedule even when you don't, run the full set again and confirm nothing regressed. A change that improves new cases while breaking old ones is a regression, and the only way to catch it is to keep running the old cases.
This also protects you when the model underneath changes. Because your test set encodes the behavior you actually depend on, re-running it tells you immediately whether a new model version still satisfies your requirements or silently broke something. The test set becomes a durable asset: a definition of "working" that survives model changes, team changes, and the slow erosion of memory about why the prompt was written the way it was.
The takeaway
Treat a prompt like the production code it is. Build a test set that looks like real usage, edge cases included, and write down what a good output means for each input. Pick a grading method on purpose, establish a baseline before you change anything, and improve the prompt one variable at a time while measuring against that baseline across the whole set. Keep the set as a regression suite and re-run it before every change and after every model update. The difference between prompt folklore and prompt engineering is exactly this: one judges by a single demo, the other measures across a set — and only the second one keeps working when the inputs, the model, or the requirements move.
