Stream and render model output in a UI
Why streaming makes AI features feel fast, and how to render token-by-token output in a UI without flicker, broken markup, or layout chaos.
The difference between an AI feature that feels fast and one that feels broken is often not the model — it is whether you stream the output. A model that takes several seconds to produce a full answer feels painfully slow if the user stares at a spinner the whole time. The same model feels responsive if the first words appear almost immediately and the rest flows in. Streaming is the technique that turns a long wait into a live response, and rendering it well is a small craft worth learning.
Why streaming changes the experience
When you make an ordinary request, you wait for the entire response before you see anything. For a long answer, that is a long stare at nothing. Streaming changes the shape of the wait: instead of one big delay at the end, the model sends its output incrementally as it generates, and you display each piece as it arrives.
The total time to finish is roughly the same. What changes is the perceived speed. Time-to-first-token — how long until something appears — drops to a fraction of a second, and humans read a streaming response as fast even when the full generation is slow. This is the same psychology as a progress bar versus a frozen screen. The work is identical; the experience is not. For anything a person waits on, streaming is close to mandatory.
How streaming works at a high level
Under the hood, a streaming response is a long-lived connection that delivers a sequence of small events rather than one final payload. Each event carries a chunk of the output — often a few characters or a token. Your code reads these events as they come, appends each chunk to whatever you have accumulated so far, and updates the display. When the stream signals completion, you have the full response, assembled piece by piece.
In pseudocode the loop is simple:
accumulated = ""
for chunk in stream(request):
accumulated += chunk.text
render(accumulated)
on_complete():
finalize(accumulated)
The provider's SDK handles the transport details. Your job is the loop: read chunks, accumulate, render, and handle the end. The interesting problems are almost all on the rendering side.
Render the accumulated text, not the deltas
The first rule of rendering a stream is to display the accumulated string, not just the latest chunk. It is tempting to append each delta straight to the DOM as raw text, but that breaks the moment you need any formatting. Markdown, code blocks, and structured output only make sense as a whole. A chunk might split a word, a markdown token, or a tag in half. If you render deltas independently you get garbled partial markup.
Instead, keep the full accumulated text in state and re-render it on each update. Modern UI frameworks make this cheap — you update one string in state, and the framework reconciles the display efficiently. Rendering the whole accumulated answer each time also means your markdown or syntax highlighting always sees complete-so-far text, which it can parse far more gracefully than isolated fragments.
Handle partial and malformed intermediate states
While a response streams, every intermediate state is incomplete by definition. A code block may have an opening fence but no closing one yet. A markdown link may be half-typed. A list may stop mid-item. If your renderer is strict, these partial states flicker or throw.
The fix is to render tolerantly. Use a markdown parser that handles unterminated structures gracefully rather than erroring, and accept that the display will briefly show in-progress formatting that resolves as more text arrives. For code blocks specifically, it helps to detect an open fence and treat the remainder as code until it closes. The principle is to design for "this text is not finished yet" as the normal case during streaming, because it is.
Tame layout shift and scrolling
A streaming response grows, and growth moves the page. Without care, the content below the response jumps around as new lines appear, and a user trying to read the top gets yanked down. Two habits keep this calm.
First, reserve space and avoid reflowing unrelated content. Render the streaming text in a container that grows downward without pushing the rest of the layout around unpredictably. Second, handle scrolling deliberately. A common, pleasant behavior is to keep the view pinned to the bottom while the user is already at the bottom, so they follow the live output — but to stop auto-scrolling the moment they scroll up to read something, so you do not fight them. Detect whether the user is near the bottom and only auto-scroll when they are.
Also throttle your updates. Re-rendering on every single token can overwhelm the browser for fast streams. Batching updates to a sensible interval — say, a few times per frame — keeps the UI smooth without noticeably hurting responsiveness.
Show state, and handle interruption
Streaming gives you natural opportunities to communicate state. Show that generation has started before the first token, indicate clearly while it is in progress, and mark when it completes. A subtle cursor or a "stop" button while streaming tells the user the system is alive and working.
That stop button matters. Because a stream is a live connection, you can cancel it midway. Give users a way to interrupt a long or wrong response rather than forcing them to wait it out — cancel the request, keep whatever text arrived so far, and return control. On the error side, a stream can fail partway through; treat a broken connection as a recoverable state, preserve the partial output, and offer a retry rather than discarding everything. Designing for cancel and partial-failure from the start is far easier than retrofitting it.
The takeaway
Streaming is the cheapest large upgrade you can give an AI feature's feel: the work is the same, but appearing immediately reads as fast. Read chunks in a loop and always render the accumulated text, never raw deltas, so formatting stays intact. Parse tolerantly because every intermediate state is unfinished, tame layout shift and auto-scroll so reading stays comfortable, and throttle updates for smoothness. Finally, treat cancellation and partial failure as first-class states. Get these right and a slow model feels responsive — which is most of what users actually judge.
