Cost control 101: keeping an AI feature affordable

AI features bill by the token, and small habits compound into large invoices. Here are the durable levers for keeping cost in line without gutting quality.

tutorials2026-04-25 14:40 KST·Lead Editor·7 min read

An AI feature has an unusual property as software goes: it costs money every single time someone uses it. Traditional code runs on servers you have already paid for, so an extra call is nearly free. A language model bills you per request, and the bill scales with usage. That changes how you build. A prototype that costs pennies a day can quietly become a feature that costs more than the team building it, and the difference is rarely one big mistake — it is a hundred small habits compounding. This walkthrough covers the durable levers for keeping an AI feature affordable, the ones that keep working regardless of which model or vendor you use.

How you actually get billed

You cannot control a cost you do not understand, so start with the unit. Language models bill by the token — a chunk of text, very roughly a few characters or part of a word. Every request is charged for the tokens going in (your prompt: instructions, context, examples, the user's input) and the tokens coming out (the model's response). Both directions cost, and on many models the output tokens cost more per token than the input.

Two consequences fall out of this immediately. First, a long prompt is not free even before the model says a word — all that context is input you pay for on every call. Second, a verbose answer costs more than a terse one that says the same thing. Your total bill is essentially tokens-per-request multiplied by requests, and almost every lever below is a way to shrink one of those two factors without shrinking quality.

Lever 1: Right-size the model

The single biggest cost decision is which model you use, because capability and price move together — the strongest models cost meaningfully more per token than smaller, faster ones. The instinct is to reach for the most capable model for everything. The discipline is to ask what each task actually needs.

Many tasks are easy: classifying a message, extracting a field, a short rewrite, a routine reply. A smaller, cheaper model often handles these perfectly well, and using a flagship model for them is paying for capability you are not using. Reserve the expensive models for the genuinely hard work — deep reasoning, nuanced generation — where the quality difference earns its price. A powerful pattern is to route by difficulty: a cheap model handles the easy majority, and only the hard cases escalate to the expensive one. The eval discipline from quality testing applies here directly — measure whether the cheaper model is good enough before assuming it isn't.

Lever 2: Spend fewer tokens per call

Once the model is right-sized, attack the tokens. On the input side, trim the prompt to what the task needs. Bloated instructions, redundant examples, and context the model does not use all cost money on every single call, and at scale that waste is the whole problem. This is doubly true in retrieval and agent systems, where it is tempting to stuff in "just in case" context — every unused passage is paid for forever.

On the output side, ask for what you need and no more. If you want three bullet points, say three bullet points; a model left unconstrained will often write three paragraphs. Where a task allows, request a compact format. None of this is about being stingy for its own sake — it is about not paying for tokens that add nothing. A prompt and a response carrying only what the task requires is both cheaper and usually clearer.

Lever 3: Stop paying for the same work twice

A great deal of AI cost is paying repeatedly for identical or near-identical work, and there are two clean ways to stop.

Caching means reusing a result you already have. If many requests share a large, unchanging chunk of context — the same long system prompt, the same reference document — prompt caching lets you avoid reprocessing that fixed portion every time, which can sharply cut the input cost of repetitive calls. Separately, if users frequently ask the exact same question, you can cache the answer and serve it directly without calling the model at all. The cheapest model call is the one you never make.

Deduplication and batching address volume. If your system would fire the same request twice, fire it once. If you have many requests that are not urgent — overnight processing, bulk analysis — handling them together is often cheaper than one frantic call at a time, and some platforms price non-urgent batch work below real-time rates. The common thread: identical work should be done once, and patient work should not pay the premium for impatience.

Lever 4: Cap the runaways

Most cost disasters are not a steady drip; they are a single component that runs away. An agent stuck in a loop, calling the model again and again. A retry that fires on every failure without limit. A user — or a script pretending to be one — sending thousands of requests an hour. Steady usage is easy to budget. Runaways are what produce the invoice that makes someone's stomach drop.

So put ceilings everywhere a loop or a queue can form. Cap the number of steps an agent may take before it stops and reports. Cap retries. Limit how many requests a single user can make in a window. Set a hard maximum on response length so one pathological request cannot generate an enormous, expensive answer. None of these caps should bite during normal use; they exist precisely for the abnormal case, and the one time they save you they pay for themselves many times over.

Lever 5: Measure before you optimize

You cannot manage what you do not see, and most cost surprises are really visibility failures — the spend was there all along, no one was watching. Before optimizing anything, instrument the feature so you know where the money goes: which operations are most expensive, how cost breaks down between input and output, which users or request types dominate.

This matters because cost, like performance, follows a skew — a small fraction of operations usually drives a large fraction of the bill. Optimizing a cheap, rare path feels productive and saves nothing. Find the expensive, frequent path first and fix that. Then watch the trend over time, because usage grows and a feature that was affordable at launch can drift expensive as adoption climbs. Set an alert for when spending crosses a threshold you chose deliberately, so you learn about a cost problem from your own dashboard rather than from a finance email.

Balancing cost against quality

Every lever here trades against something, and pretending otherwise leads to a cheap feature nobody wants to use. A smaller model saves money and may answer slightly worse. A shorter prompt saves tokens and may drop a piece of context that mattered. Aggressive caching saves calls and risks serving a stale answer. The goal is not the lowest possible cost — that is a feature with no users. The goal is the lowest cost that still meets your quality bar. This is exactly where an eval earns its keep: it lets you make a cost-cutting change and check, with numbers rather than hope, that quality held. Cut cost, measure quality, keep the cuts that quality survives.

The takeaway

Keeping an AI feature affordable comes down to a handful of durable habits: bill-aware design that understands tokens, right-sizing the model to each task, spending fewer tokens per call, never paying twice for the same work, capping the components that can run away, and measuring everything so you optimize the path that actually costs money. None of it requires a clever trick, and all of it survives changes in model and vendor because it follows from how these systems are priced. Pair each cut with an eval so quality stays honest, and a feature that could have bankrupted itself stays one you can afford to keep running.

#cost#tokens#caching#tutorial

Primary sources

Anthropic — pricing and token usage OpenAI — production best practices