Choose the right model size for a task

Bigger is not always better. A practical method for picking a model size that matches the task, the budget, and the latency you can live with.

tutorials2026-05-09 15:05 KST·Lead Editor·7 min read

Most teams reach for the largest model they can afford and call it a day. That works, in the sense that a capable model handles almost anything you throw at it. But it is rarely the right call. The largest model is the slowest and most expensive option, and many tasks do not need it. Choosing a model size is a matching problem: you want the smallest model that reliably does the job, not the biggest one you can justify.

What "size" actually buys you

Model providers usually offer a family — a small, fast, cheap tier and a large, slow, capable tier, often with something in between. The larger models are better at hard reasoning, subtle instructions, long-range consistency, and tasks where mistakes are subtle rather than obvious. The smaller models are dramatically faster and cheaper, and on straightforward tasks they are often just as accurate.

The key insight is that capability is not a single dial you always want turned to maximum. It is a resource you spend. A model that is "smarter than the task requires" delivers no extra value for the surplus — it just costs more and responds slower. The goal is to find where the task's difficulty meets the model's capability, and stop there.

Start by describing the task honestly

Before comparing models, describe what the task really demands. A few questions sharpen this quickly. Does the task require multi-step reasoning, or is it closer to a lookup or a transformation? How varied are the inputs — a narrow, predictable format, or messy open-ended text? How costly is a wrong answer — a minor annoyance, or a real failure? And how visible is latency — does a human wait on the response, or does it run in the background?

Classification, extraction, formatting, short rewrites, and routing are usually easy in this sense: predictable inputs, clear right answers, low reasoning depth. Open-ended analysis, multi-step planning, nuanced writing, and tasks where the model must hold many constraints at once are hard. Be honest here. The temptation is to call your task hard because it feels important. Importance and difficulty are different — an important task with simple mechanics still belongs on a small model.

The cheap-first method

The reliable way to choose is to start small and move up only when forced. Begin with the smallest model in the family and run it against a set of real, varied inputs — not one demo, a handful that covers the easy and tricky cases. Read the outputs. If the small model is reliably good enough, you are done, and you have the fastest, cheapest option.

If it fails, look at how it fails. Sometimes the fix is a better prompt, not a bigger model — clearer instructions, an example or two, a named failure mode. Try that first. If the small model still falls short after a fair prompt, step up to the mid tier and repeat. Only escalate to the largest model when a real evaluation shows the smaller ones cannot do the job. This climb-from-the-bottom approach keeps you from overpaying by default, and it surfaces prompt problems that a large model would have quietly masked.

Match the model to the step, not the app

A common mistake is choosing one model for an entire application. Most real systems are pipelines with steps of different difficulty. Routing a request to the right handler is easy. Drafting a careful answer to a complex question is hard. Reformatting that answer into JSON is easy again. Using your largest model for every step means paying premium rates for the trivial ones.

A better design assigns a model size to each step. Cheap models handle classification, routing, extraction, and formatting. Expensive models handle the one or two steps that genuinely need deep reasoning. This is sometimes called a cascade or a router pattern: a small model does most of the work and either answers directly or hands the hard cases up to a larger one. The result is a system that spends its capability budget where it matters and saves everywhere else.

Weigh cost and latency, not just accuracy

Accuracy is the headline, but it is not the only axis. Latency shapes user experience directly — a response that takes several seconds feels broken in an interactive chat but is fine in a nightly batch job. If a human is waiting, a faster small model that is slightly less polished can beat a slower large one that is marginally better.

Cost compounds at scale. A price difference between tiers that looks trivial per request becomes the difference between a sustainable product and an unsustainable one once you multiply by millions of calls. When you evaluate, write down accuracy, latency, and cost together for each candidate. The right choice is often the smallest model that clears your accuracy bar, because it wins decisively on the other two. Reserve the large model for the cases where the accuracy gap is real and the stakes justify the premium.

Re-check the decision over time

A model choice is not permanent. Providers release new models regularly, and the small tier of next year often matches the large tier of last year. A task that needed your biggest model today may run comfortably on a cheaper one after the next release. Keep your evaluation set around so you can re-run it against new models when they ship. The same set that helped you choose initially lets you cheaply revisit the choice — and the trend almost always favors moving down a tier, not up.

This is also why you should avoid hard-coding assumptions about a specific model into your prompts and code. Keep the model behind a small abstraction so swapping it is a configuration change, not a rewrite. The teams that benefit most from model progress are the ones who made switching easy.

The takeaway

Choosing a model size is matching, not maximizing. Describe the task's real difficulty, start with the smallest model, and climb only when a genuine evaluation forces you to. Assign sizes per step rather than per app, so cheap models do the easy work and expensive ones handle the few hard parts. Weigh latency and cost alongside accuracy — the smallest model that clears your bar usually wins overall. And revisit the choice as new models ship, because the smart default keeps getting cheaper.

#models#cost#latency#evaluation

Primary sources

Anthropic — documentation OpenAI — documentation