Rate limits and retries: building resilient LLM calls

Hosted LLMs fail in ordinary ways — limits, timeouts, transient errors. A little retry discipline turns a fragile integration into a dependable one.

tools2026-04-10 08:22 KST·Lead Editor·7 min read

The first version of almost every LLM integration works on the developer's laptop and falls over the first busy afternoon in production. The reason is rarely the model. It is that a hosted API is a shared network service, and shared network services fail in ordinary, predictable ways: they impose limits, they time out, they occasionally return an error that means nothing more than "try again in a moment." The difference between a fragile integration and a dependable one is not cleverness. It is a small amount of retry discipline applied consistently. This guide covers what goes wrong and how to handle each case calmly.

Why your calls fail (and which failures are normal)

Start by sorting failures into two piles, because they demand opposite responses.

Transient failures are temporary and not your fault in any lasting sense. The service was briefly overloaded, you hit a rate limit, a request timed out, a connection dropped. The defining feature is that the same request might succeed if you simply send it again. These are the failures retries exist for.

Permanent failures will not improve on repetition. A malformed request, an invalid key, a prompt that violates policy, an input too large for the model — sending it again just wastes time and quota and, worse, can dig you deeper into a rate limit. The defining feature is that the request is wrong, not unlucky.

The single most important habit in resilient LLM code is telling these two apart and responding differently. Retrying a permanent failure is a bug. Failing hard on a transient one is also a bug. Most fragile integrations make one of these two mistakes everywhere.

Understanding rate limits

Rate limits are the most common transient failure, and they are not punishment. They are how a shared service protects itself and its other users from any single client overwhelming it. Providers typically cap usage along a couple of axes at once — how many requests you make in a window, and how much total work (often measured in tokens) you push through in that window. You can stay under one cap and still hit the other.

The practical consequence is that you cannot reason about your throughput by counting requests alone. A handful of very large requests can exhaust a token budget while you are nowhere near the request cap. When a rate limit fires, the service tells you so explicitly, often with a hint about how long to wait. The correct response is not to hammer harder. It is to slow down and come back.

Two habits prevent most rate-limit pain before it starts. First, read the response headers and error bodies — providers expose your current usage and limits there, and that information is the input to backing off intelligently. Second, smooth your own traffic: if you have a burst of work, spread it out rather than firing it all at once, so you approach the limit gradually instead of slamming into it.

Retrying the right way

When you do retry, how you retry matters enormously. The naive approach — retry immediately, over and over — is actively harmful. If the service is overloaded, a flood of instant retries makes it worse, and you become part of the problem you are trying to survive. The disciplined approach has three ingredients.

Exponential backoff. Wait a little before the first retry, then roughly double the wait before each subsequent one. The first hiccup gets a quick second chance; a persistent problem gets progressively more breathing room. This single pattern resolves the large majority of transient failures.

Jitter. Add a small random amount to each wait. Without it, many clients that failed at the same instant will all retry at the same instant, producing a synchronized stampede that re-overloads the service. Jitter spreads the retries out. It is a tiny change with an outsized effect at scale, and skipping it is a classic mistake.

A retry ceiling. Cap the number of attempts and the total time you will spend. Retrying forever turns a brief outage into a hung request that ties up resources and frustrates whoever is waiting. After the ceiling, give up cleanly and surface a real failure.

Put together: on a transient error, wait with exponential backoff plus jitter, retry up to a fixed limit, and if a hint about how long to wait was provided, honor it over your own computed delay.

Timeouts and the limits of retrying

Every call needs a timeout, and choosing it is a genuine trade-off. Too short and you abandon requests that would have succeeded, turning slow-but-fine responses into failures. Too long and a stuck request hangs your system and ties up a user. Pick a timeout based on the response length you actually expect, and remember that long generations legitimately take longer — a timeout tuned for a one-line answer will wrongly kill a request for a long one.

Timeouts interact with retries in a way that bites people. A request can time out on your side while the server is still working on it. Retry blindly and you may run the same expensive work twice. For read-only generation that is merely wasteful. For any call that causes an effect — sending a message, writing a record, triggering a tool — duplicate execution is a real bug. The defense is idempotency: design effectful operations so that doing them twice is safe, often by attaching a unique key the server can use to recognize and de-duplicate a repeat.

Failing gracefully when retries run out

Resilience is not only about recovering. It is also about failing well when recovery is impossible, because sometimes the service really is down and no amount of backoff helps. A graceful failure has a few properties.

It is bounded. The user or calling system gets a clear answer in reasonable time, not an indefinite hang.
It degrades rather than collapses. Where the product allows, fall back to something useful — a cached result, a simpler non-model path, an honest "this is temporarily unavailable" — instead of a blank error.
It is visible. Log the failure with enough context to understand it later, and surface a signal you can monitor, so a rising failure rate reaches you before your users escalate it.

The mark of a mature integration is not that it never fails. It is that when it fails, nothing downstream is surprised.

See it before it hurts

You cannot tune what you cannot see. Track, at minimum, your failure rate by type, how often retries fire and how often they ultimately succeed, your latency including the time spent in backoff, and how close you are running to your rate limits. That last one is the early-warning system: usage creeping toward the ceiling is your cue to spread traffic, optimize prompts, or request higher limits before the wall, not after. Most rate-limit incidents are visible as a trend hours in advance — to anyone who is looking.

The takeaway

Resilient LLM calls come down to a short, boring discipline. Separate transient failures from permanent ones and respond to each correctly. Retry transient failures with exponential backoff, jitter, and a firm ceiling — never in a tight immediate loop. Respect rate limits by smoothing your traffic and reading what the service tells you. Set timeouts deliberately, and make effectful calls idempotent so a retry never executes twice. Fail in a bounded, visible, degrading way when retries are exhausted, and watch your limits so you act before the wall. None of it is glamorous, and all of it is what separates an integration that survives a busy afternoon from one that does not.

#rate-limits#retries#reliability#llm-api

Primary sources

OpenAI API documentation Anthropic documentation