Handle errors and timeouts gracefully
Model calls fail, stall, and rate-limit. A practical guide to retries, timeouts, fallbacks, and fail-safe behavior that keeps an AI feature reliable.
A demo assumes everything works. A product assumes things will fail and stays useful anyway. When you call a language model over the network, you have inherited every failure mode of a remote service — timeouts, rate limits, transient errors, slow responses — plus a few specific to models, like outputs that do not match the shape you needed. The difference between a fragile feature and a reliable one is almost entirely in how it handles the unhappy paths. This guide walks through the failures you should plan for and how to handle each gracefully.
Know the failures you are planning for
Model calls fail in a few recognizable ways, and handling them well starts with telling them apart. Transient errors are temporary hiccups — a brief network blip or a server-side error that succeeds if you simply try again. Rate limits are the provider telling you to slow down because you are sending requests too fast. Timeouts happen when a response takes longer than you are willing to wait, which for models is common because generation time varies with output length. Invalid input errors mean the request itself was malformed and retrying unchanged will fail identically. And there are content-level failures: the call succeeds, but the output is wrong, empty, or not in the format you required.
These categories matter because the right response differs for each. Retrying a transient error is correct; retrying a malformed request is pointless. Backing off on a rate limit helps; hammering harder makes it worse. Before writing handling code, classify the error you are looking at, because the classification determines the cure.
Retry the transient, back off politely
For transient errors and rate limits, retrying is the answer — but how you retry matters. Retrying immediately and repeatedly can make things worse, especially with rate limits, where a flood of instant retries just keeps you throttled. The standard, well-behaved approach is exponential backoff with jitter: wait a short time before the first retry, roughly double the wait on each subsequent attempt, and add a small random offset so many clients do not all retry in lockstep.
delay = base
for attempt in range(max_attempts):
try:
return call_model(request)
except Transient or RateLimited:
wait(delay + random_jitter())
delay = delay * 2
raise GiveUp
Cap the number of attempts and the maximum delay so a failing call does not retry forever. And only retry the errors that retrying can fix — wrap retries around transient and rate-limit failures, but let malformed-request errors fail fast, because retrying them wastes time and money on a request that cannot succeed.
Always set a timeout
A model call without a timeout is a trap. Generation time is variable, and occasionally a request stalls far longer than usual. Without a timeout, a single slow call can tie up a request handler, exhaust a connection pool, and cascade into an outage that looks like a total failure even though only a few calls misbehaved. Always set an explicit timeout on every model call, chosen to match how long the surrounding context can actually wait.
Pick the timeout deliberately. An interactive feature where a human waits needs a tighter limit than a background batch job that can afford patience. When a call exceeds the timeout, treat it like a transient failure: cancel it, and either retry or fall back. The point is that you decide how long to wait, rather than letting an unbounded call decide for you. Streaming helps here too — if you stream the response, time-to-first-token gives you an early signal that a call is alive, and you can apply a separate, tighter timeout to that first chunk.
Have a fallback for when retries run out
Retries buy you resilience against temporary trouble, but sometimes the trouble is not temporary — a provider has a sustained outage, or every attempt times out. For these cases, decide in advance what your feature does when the model is simply unavailable. The wrong answer is an unhandled exception bubbling up as a broken screen.
Fallbacks come in several flavors depending on the task. You might fall back to a smaller or alternative model that is more likely to be available. You might serve a cached or default response if one fits. You might degrade gracefully to a non-AI path — a simple rule, a manual option, a "try again later" message that preserves the user's input so nothing is lost. The right fallback depends on the feature, but every AI feature should have one. The question to answer before launch is: when the model cannot respond at all, what does the user see? "An error page" is not an acceptable answer for anything important.
Validate the output, not just the call
A successful call is not a successful result. The model can return text that is empty, off-topic, or — most commonly painful — not in the structure your code expects. If you asked for JSON and parse the response assuming it is valid, a malformed output crashes your code just as surely as a network error. Treat the model's output as untrusted input and validate it before you rely on it.
After a call, check that the output meets your requirements: it parses, it has the required fields, it is within expected bounds. When validation fails, you have options. You can retry the call, sometimes with a clarifying instruction that points out what was wrong. You can attempt a tolerant repair for minor issues. Or you can fall back. What you should not do is pass unvalidated model output directly into code that assumes it is well-formed. The model is a probabilistic component; defensive validation is how you make a probabilistic component safe to build on.
Make failures visible and observable
You cannot fix failures you never see. Log errors with enough context to understand them — the type of failure, the input that triggered it, how many retries it took, whether the fallback fired. Watching the rates of timeouts, rate limits, and validation failures tells you when something is degrading before users complain. A sudden spike in timeouts might mean a provider issue; a steady stream of validation failures might mean your prompt drifted or inputs changed.
Surface errors to users honestly but gently. A clear, calm message that something went wrong and their input is safe beats a cryptic stack trace or a silent empty response. Internally, make sure failures are loud in your monitoring even when they are quiet for the user. The combination — graceful to the user, visible to you — is what lets you keep a feature reliable over time instead of discovering its weak points through complaints.
The takeaway
Reliability is built on the unhappy paths. Classify the failure you are facing — transient, rate-limited, timeout, malformed, or bad output — because each needs a different response. Retry the retryable with exponential backoff and jitter, fail fast on the rest, and put an explicit timeout on every call so one slow request cannot stall your system. Decide in advance what happens when the model is unavailable, and give every feature a real fallback. Validate the output as untrusted input before relying on it, and keep failures visible in your monitoring. Handle these well and your feature stays useful even on the days the model does not cooperate.
