The modern AI app stack, end to end
A clear map of the layers that make up a real AI application — model, orchestration, retrieval, evaluation, and the unglamorous glue that holds it together.
When people picture an AI application, they picture the model. But the model is only one layer of several, and most of what determines whether an AI feature actually works lives in the unglamorous layers around it. Understanding the full stack — what each piece does and why it exists — turns "AI is unpredictable magic" into a set of components you can reason about, debug, and improve. This is an end-to-end map, from the user's input to the model and back, with honest notes on where each layer earns its keep.
The model layer: capability, not application
At the center sits the model itself, hosted by a provider or run yourself. This is the raw capability: text in, text out, with some reasoning in between. It is essential and it is also the layer you have the least control over, because you are mostly choosing among options rather than building it. The mistake is treating this choice as the whole project. A capable model wired into a poor application underperforms a modest model inside a well-built one.
Practically, this layer is a decision with trade-offs: hosted versus self-run, larger versus smaller, general versus specialized. Each has cost, latency, privacy, and capability implications. The right call depends on your task, and — importantly — it is replaceable. Designing so you can swap the model later, as options improve or your needs change, is one of the highest-leverage architectural decisions you can make early.
The orchestration layer: turning calls into behavior
A single model call is rarely an application. Real features chain calls, branch on results, retry on failure, call tools, and assemble prompts from multiple sources. The orchestration layer is the code that coordinates all of this — deciding what to send, in what order, and what to do with each response. This is where your application's actual logic lives, and where most of the engineering effort goes.
Orchestration is also where complexity quietly accumulates. Each added step — a retrieval call, a tool invocation, a second model pass to check the first — adds latency, cost, and a new way to fail. The discipline here is to add steps only when they earn their place, and to keep the flow simple enough that you can still reason about what happened when something goes wrong. A pipeline you cannot trace is a pipeline you cannot fix.
The context layer: prompts, retrieval, and memory
Models know only what is in front of them. The context layer is everything that assembles what the model sees: the prompt that frames the task, retrieved documents that ground the answer in your data, and any memory of prior turns in a conversation. This is frequently the layer that decides quality, because the same model produces wildly different results depending on what context it is given.
Retrieval belongs here. When an application answers using your own documents, this layer embeds the query, finds relevant passages, and folds them into the prompt. Memory belongs here too — deciding what from earlier in a conversation is worth carrying forward and what to drop. Done well, this layer makes a general model feel like it knows your specific world. Done poorly, it feeds the model irrelevant or contradictory material and you blame the model for the result.
The tools layer: letting the model act
A model that can only produce text is limited; many useful applications need the model to do things — look up live data, run a calculation, query a database, call an external service. The tools layer defines the actions available to the model and safely executes them when the model asks. The model decides what to do; your code decides whether and how it actually happens.
The critical word is safely. Tools are where an AI application touches the real world, which means they are where mistakes have consequences beyond a bad sentence. This layer needs guardrails: validating what the model requests, limiting what it can reach, and confirming irreversible actions. Treat the model's tool requests as untrusted input, because in effect they are — the model is suggesting an action, not authorizing it, and your code holds the authority.
The evaluation layer: knowing if it works
Traditional software either works or throws an error. AI applications fail more subtly: they produce output that is plausible but wrong, and they degrade quietly as you change prompts or swap models. The evaluation layer is how you know whether the system actually does its job — a set of representative test cases and a way to measure quality against them, run continuously rather than checked once by eyeballing a few outputs.
This layer is the one teams most often skip and most often regret skipping. Without it, every change is a gamble: you improve one case and silently break three others with no way to notice. With even a modest evaluation set, you can change the model, tune a prompt, or add a retrieval step and measure whether you helped or hurt. Evaluation is what turns AI development from guesswork into engineering, and it is worth building before you think you need it.
The observability and cost layer: running it in the open
Once an AI feature is live, you need to see what it is doing. The observability layer logs the prompts, the retrieved context, the model responses, the tool calls, the latencies, and the costs. When a user reports a bad answer, this is how you reconstruct what actually happened instead of guessing. AI systems are non-deterministic, which makes good logging more important than in ordinary software, not less.
Cost lives alongside observability because in AI applications cost is a runtime property, not a fixed line item. Every call consumes tokens, every retrieval adds context, every extra orchestration step multiplies usage. Without visibility, costs drift upward unnoticed until a bill or a rate limit forces the conversation. Watching cost as a first-class signal — per request, per feature, over time — keeps the economics from quietly breaking the product.
How the layers fit together
Trace a single request and the stack becomes concrete. The user's input arrives, the orchestration layer takes over, the context layer assembles a prompt with any retrieved documents, the model produces a response, the tools layer executes any actions the model requested, and the result returns — while the observability layer records the whole journey and the evaluation layer, offline, keeps checking that journeys like it still go well. Each layer is replaceable, and weakness in any one caps the quality of the whole.
The strategic insight is that these layers fail and improve independently. A disappointing application usually has one weak layer, not a fundamentally weak design. Diagnosing which layer — bad retrieval, sloppy orchestration, an unsafe tool, no evaluation — is what separates teams that steadily improve their AI features from teams that keep swapping models hoping the next one fixes everything.
The takeaway
A modern AI application is a stack, not a model. The model supplies capability, but orchestration coordinates it, the context layer feeds it, the tools layer lets it act, evaluation tells you if it works, and observability lets you run it in daylight. Most quality and most cost live in the layers around the model, not in the model itself. Build with that map in mind and AI development stops feeling like magic and starts feeling like what it is: ordinary engineering with one unusually unpredictable component at the center.
