Guardrails: filtering inputs and outputs around an LLM

A model alone is not a safe product. Guardrails are the input and output filters that keep an LLM inside the boundaries you actually need.

tools2026-06-16 12:31 KST·Lead Editor·7 min read

A language model will cheerfully do things you never wanted it to do. It will answer a question outside its remit, repeat a user's hostile framing, leak instructions it was told to keep private, or produce output your downstream code cannot parse. None of this means the model is broken. It means a raw model is a capability, not a product. Guardrails are the layer that turns the first into the second: the checks you run on what goes in and what comes out, so the system stays inside boundaries you can defend.

What a guardrail actually is

Strip away the marketing and a guardrail is a filter with a decision attached. Something — text, a request, a generated answer — passes through a check, and the check either allows it, blocks it, or transforms it. That check can be a simple rule, a classifier, or another model call. The important part is the decision: a guardrail that detects a problem but does nothing is just logging.

Guardrails come in two families, and the distinction is the most useful one to hold in your head:

Input guardrails sit between the user and the model. They inspect the request before it reaches generation.
Output guardrails sit between the model and the user (or the next system). They inspect what the model produced before anyone acts on it.

Most real systems need both, because the two families catch different failures. Filtering the input does not guarantee a safe output, and a clean-looking output can come from a request that should never have been served.

Filtering the input

Input guardrails answer one question: should we even run this through the model? Several checks earn their place here.

The first is moderation — screening for content you have decided not to engage with at all. This is the clearest case for a dedicated classifier rather than hand-written rules, because the categories are fuzzy and adversarial users probe the edges constantly.

The second is prompt-injection awareness. When your application pulls in text from outside — a web page, an uploaded document, an email — that text may contain instructions aimed at your model rather than content for it to process. An input guardrail cannot fully solve injection, but it can flag suspicious patterns and, crucially, it reminds you to keep untrusted input clearly separated from your own instructions.

The third is scope. Many products are meant to do one thing. A support assistant for a banking app has no business writing poetry or giving medical advice, not because those are harmful but because they are off-mission and erode trust. A lightweight topical check keeps the system honest about what it is for.

The discipline to adopt early: never trust input because it came from your own UI. The text field is the front door, and the front door is where the trouble walks in.

Filtering the output

Output guardrails are where most teams under-invest, and they are often the more important half. The model has now generated something, and before it reaches a human or triggers an action, you get one more chance to catch a problem.

Useful output checks include:

Safety and policy. Did the model produce content that violates your stated policy, regardless of how it was prompted? This is your backstop for the injection and jailbreak attempts the input layer missed.
Format and structure. If your code expects a specific shape — a particular set of fields, a category from a fixed list — verify it before parsing. A model that returns prose where you expected structured data should be caught here, not by a crash three functions downstream.
Grounding. For systems that answer from a knowledge source, check whether the answer is actually supported by the retrieved material rather than invented. This is harder than the other checks and rarely perfect, but even a coarse version catches confident fabrication.
Leakage. Did the response reveal system instructions, internal identifiers, or another user's data? A simple check against known sensitive strings is cheap and worth running.

The reason output guardrails matter more than they seem is that they are the layer closest to consequences. By the time something reaches the output check, all the upstream cleverness has already happened. This is the last honest gate.

Choosing how strict to be

Every guardrail has two ways to fail. It can let through something it should have blocked (a miss), or it can block something perfectly fine (a false alarm). You cannot drive both to zero, and pretending otherwise produces a system that is either unsafe or unusable.

The right balance depends entirely on stakes. A guardrail in front of an action that moves money or sends an email to a customer should lean strict and fail closed — when in doubt, block and escalate. A guardrail on a brainstorming assistant can lean permissive, because the cost of a false alarm (a frustrated user) outweighs the cost of an occasional miss (a slightly off answer the user simply ignores). Decide where each guardrail sits on this spectrum before you tune it, because the question "how strict?" has no answer without knowing what is on the other side of the gate.

Building it without slowing everything down

Guardrails add work, and work adds latency and cost. A few patterns keep that manageable.

Run the cheap checks first. Simple rule-based filters and string matches cost almost nothing; run them before any check that requires a model call, and let an early block short-circuit the rest. Reserve the expensive classifier-and-model checks for input that has already passed the cheap gates.

Run independent checks in parallel where you can, so total latency is the slowest check rather than the sum of all of them. And separate blocking checks from monitoring ones: a check that must pass before the user sees a response has to run inline, but a check you only want for analytics can run after the fact, off the critical path.

Finally, log every guardrail decision — what fired, on what, and what happened. Guardrails are only as good as your ability to see where they are too loose or too tight, and that visibility comes entirely from the logs.

What guardrails cannot do

It is worth being honest about the ceiling. Guardrails reduce risk; they do not eliminate it. A determined adversary will find phrasings your filters miss, and a sufficiently capable model will occasionally produce something surprising that no check anticipated. Guardrails are also not a substitute for the model being well-behaved in the first place — they are a defense layer, not the only defense.

Treat them as one part of a larger posture that includes a model chosen and prompted for the task, least-privilege access for any actions the model can trigger, human review where the stakes demand it, and an incident plan for when something slips through. A guardrail layer that lets you sleep at night is the product of all of these together, not a single clever filter.

The takeaway

Guardrails are the difference between a model and a product you can put in front of real users. Filter the input to decide what is worth running, filter the output to decide what is safe to act on, and tune each gate's strictness to what sits behind it. Keep the cheap checks first and the expensive ones rare, log every decision, and stay humble about the ceiling. A model gives you capability; guardrails give you the boundaries that make that capability safe to ship.

#guardrails#safety#llm-ops#moderation

Primary sources

OpenAI API documentation Anthropic documentation