Putting an LLM in customer support: what breaks first

A support chatbot is the easiest AI demo and one of the hardest things to run well. Here is where real deployments break — and what separates the ones that survive.

use-cases2026-04-02 12:31 KST·Lead Editor·7 min read

Customer support is the most popular first use case for AI in a business, and for good reason: the demo is irresistible. You connect a model to your help docs, ask it a question, and it answers fluently. The gap between that demo and a system you can actually run is where most projects struggle. This piece walks through what breaks first, in roughly the order teams hit it, so you can plan for the failures instead of discovering them in production.

The demo is easy; the operation is hard

The demo works because it shows the easy path: a clear question with a clean answer that lives in the docs. Real support traffic is not that. It is ambiguous questions, missing information, edge cases, angry customers, and requests that require an action rather than an answer. The model that aced the demo meets all of this on day one. Planning for the demo is planning for the five percent of traffic that was never the problem.

Break #1: It answers when it should not know

The first thing to break is confidence. A support model will answer questions whose answers are not in your documentation, inventing plausible policies, prices, or steps. Customers cannot tell a grounded answer from a fabricated one — both sound equally fluent. A single confidently wrong answer about a refund policy can cost more than the whole system saves.

The fix is grounding and honesty: retrieve from your actual documentation, instruct the model to answer only from what it retrieved, and to say clearly when it does not know and hand off. The hard part is not the instruction — it is accepting that "I don't know, let me connect you to someone" is a good outcome, not a failure.

Break #2: Retrieval misses the relevant document

Once you ground answers in your docs, the next failure moves upstream: the model answers from the wrong document, or no document, because retrieval missed the relevant one. This is the same lesson every retrieval system learns — most failures are retrieval failures. If the right help article is not in front of the model, no amount of fluent generation will produce the right answer.

This is where the unglamorous work pays off: keeping the knowledge base clean and current, chunking articles sensibly, and measuring whether the right article is actually being retrieved for real questions — separately from whether the final answer reads well.

Break #3: The knowledge base is wrong or stale

A support model is only as good as the documents behind it. Most companies discover, when they wire up a model, that their help center contradicts itself, describes features that changed, or never documented the thing customers ask about most. The AI does not create this problem; it surfaces it, at scale, in front of customers. Teams that succeed treat the documentation cleanup as part of the project, not a prerequisite someone else will handle.

Break #4: It cannot take actions safely

Answering questions is one thing; doing something — issuing a refund, changing an address, cancelling an order — is another. The moment a support assistant can take actions, the stakes change. A wrong answer is embarrassing; a wrong action moves money or data. Real deployments draw a careful line: low-risk actions the model can take directly, higher-risk ones that require confirmation or a human, and a clear audit trail for all of them. This kind of context-aware risk management is exactly what frameworks like the NIST AI Risk Management Framework encourage — match the controls to the consequences.

Break #5: The handoff is clumsy

No support AI handles everything, so the handoff to a human is part of the product, not an afterthought. Where it breaks: the model hands off without passing context, so the customer repeats themselves and gets angrier; or it refuses to hand off and traps the customer in a loop. A good handoff is seamless — it carries the conversation, the customer's intent, and what was already tried, so the human starts informed.

Break #6: Nobody is measuring the right thing

The final failure is quiet. Teams measure deflection rate (how many tickets the AI handled) and celebrate, without measuring whether those answers were correct or whether customers came back angrier. Deflection without quality is just hiding problems. The deployments that last measure resolution quality and customer outcomes, read real transcripts regularly, and treat the worst answers as the signal — not the average.

What separates the survivors

The pattern across deployments that work: they treat the model as one component in a system, not the system. They ground answers, design the "I don't know" path deliberately, keep the knowledge base honest, gate risky actions, make handoffs graceful, and measure quality rather than just volume. None of this is exotic. It is the difference between shipping the demo and running the operation.

The takeaway

A support chatbot is the easiest AI demo and one of the hardest things to run well. It breaks, in order, on overconfidence, retrieval, stale documents, unsafe actions, clumsy handoffs, and the wrong metrics. Every one of those is foreseeable. Plan for them up front, and an AI support layer becomes a genuine asset. Skip the planning and ship the demo, and your customers will find the failures for you.

#customer-support#deployment#rag#operations

Primary sources

NIST AI Risk Management Framework