Observability for LLM apps: logging what matters
When an LLM app misbehaves, "it gave a bad answer" is not a debuggable fact. Here is what to log so you can actually find out why.
A traditional application either works or throws an error, and when it breaks you read the stack trace. An LLM application has a third state that is far more common and far harder to debug: it runs fine, returns no error, and produces an answer that is subtly or badly wrong. There is no exception to catch. Observability for LLM apps is the practice of capturing enough about each interaction that "it gave a bad answer" becomes a question you can actually investigate. This explainer covers what to log, why, and how to do it without drowning in data.
Why ordinary monitoring is not enough
Classic observability watches for errors, latency, and resource use. Those still matter, but they miss the failure mode unique to language models: a perfectly successful call that returns the wrong content. The HTTP status is fine, the latency is normal, nothing is on fire — and the output is hallucinated, off-instruction, or unsafe.
To debug that, you need to see inside the interaction. What exact prompt went to the model, including the parts assembled at runtime? What did the model return? What context was retrieved and injected? Which version of the prompt was live? None of this is visible from infrastructure metrics. LLM observability exists to make the content of each interaction inspectable after the fact, because at the moment something goes wrong, the only durable evidence is what you logged.
The core record: capture the full interaction
The foundation is a record of each model interaction complete enough to reconstruct what happened. At minimum, that means:
- The full resolved prompt. Not the template — the actual text sent, with all variables, retrieved context, and conversation history filled in. The bug is very often in what got assembled, not in the template you wrote.
- The full response. Exactly what the model returned, before your post-processing reshapes or truncates it.
- The model and parameters. Which model, and the settings that shape behavior like temperature. The same prompt behaves differently across these.
- The prompt version. Which version of your prompt was live, so a regression can be tied to a specific change.
With these four, most "why did it do that" investigations become tractable. Without them, you are guessing. The discipline is to log the resolved state, because the gap between the template you intended and the prompt you actually sent is where a surprising share of bugs live.
Trace the whole chain, not just the model call
Modern LLM apps are rarely a single call. A request might retrieve documents, call the model, invoke a tool, then call the model again. When the final answer is wrong, the cause could be at any step — bad retrieval, a malformed tool result, or the model itself.
This is why tracing matters as much as logging. A trace ties together every step of handling one request into a single linked timeline, so you can follow the request from input to output and see where it went off course. Was the wrong document retrieved? Then the model reasoned correctly over bad context, and the fix is in retrieval, not the prompt. Did retrieval succeed but the model ignore the context? A different fix entirely. Without an end-to-end trace, multi-step systems are nearly impossible to debug, because you cannot tell which link in the chain failed.
Metrics worth watching over time
Beyond per-interaction records, a handful of aggregate signals tell you whether the system is healthy in production.
- Latency, including time to first token. For streaming experiences, how long until output begins is often what users actually feel, distinct from total completion time.
- Token usage. Tokens consumed per request map directly to cost and creep upward as prompts and context grow. Watching them catches expensive drift early.
- Error and refusal rates. Both hard failures and the softer pattern of the model declining or hedging. A rising refusal rate often signals a prompt or policy problem worth investigating.
- Throughput and volume. Baseline traffic so anomalies stand out against it.
These do not tell you whether answers are good — only whether the system is behaving normally. They are the early-warning layer: when a metric moves, you go to the per-interaction logs to find out why. Track them as trends, not single points, since a number is only meaningful against its own history.
Judging quality, not just activity
The hardest thing to observe is whether outputs are actually good, because there is usually no single correct answer to compare against. A few practical approaches make quality at least partly visible.
Capture user signals when they exist — explicit thumbs up and down, or implicit signals like a user retrying, abandoning, or editing the output. These are noisy but real, and they point you toward the interactions worth examining. Keep a curated set of representative inputs and run new prompt or model versions against it before rollout, so you can compare quality deliberately rather than discovering a regression in production. And sample real interactions for human review on a regular cadence; a small, routine read of actual outputs catches drift that no metric surfaces. The point is not to automate judgment fully but to stop flying blind.
Logging responsibly
Capturing full prompts and responses means capturing whatever users typed — which can include personal or sensitive information. Observability cannot become a quiet data-retention problem.
Build privacy in from the start. Know what is being stored, redact or avoid storing sensitive fields where you can, set retention limits so logs do not accumulate forever, and control who can read them. There is a genuine tension here: the richer your logs, the better your debugging and the larger your exposure if those logs leak. Resolve it deliberately rather than by accident. The provider documentation from OpenAI and Anthropic describes their own data handling, which is part of the picture, but what you store is your responsibility to govern.
Start small and grow
You do not need a full observability platform on day one, and pretending otherwise is a good way to never start. Begin by logging the core record — resolved prompt, response, model, and prompt version — for every interaction. That single step makes most early debugging possible. Add tracing when your system grows past a single call. Add aggregate metrics when you have enough traffic for trends to mean something. Add quality evaluation when you are tuning prompts and models in earnest. Each layer earns its place; build them in the order your application actually grows, not all at once.
The takeaway
LLM apps fail in a way traditional monitoring cannot see — a successful call that returns a wrong answer — so observability has to look inside each interaction, not just at the infrastructure around it. Log the full resolved prompt and response with the model and prompt version, trace multi-step requests end to end, watch latency and token usage as trends, and find a way to sample quality rather than assuming it. Do it with privacy designed in from the start, and grow the layers as your system grows. The reward is that "it gave a bad answer" stops being a shrug and becomes a question you can answer.
