Production-tested UX patterns for AI products — streaming, RAG citations, token-cost transparency, quota fallback, evals, and AI empty states. Each pattern shown with a real production screen, not a mockup.
Maintained by Aleksey Stepikin (Stepikin Studio — solo+AI design studio). Full annotated version with problem/pattern/why breakdowns: stepikin.com/llm-ux-patterns
Most AI products fail at the interface, not the model. The model works — but the screen around it hides cost, hides sources, breaks under rate limits, and leaves new users staring at a blank page. These six patterns address that.
Problem: a spinner for 20 seconds, then a wall of text. The user can't tell if the model is thinking, stuck, or about to be wrong. Trust collapses in the silence.
Pattern: stream the steps, not just tokens — the brief as understood, the plan, each tool call with its actual arguments, each result, in order. Add a live token/latency counter.
Why: watching the work makes AI feel competent instead of magic, and makes failure legible — the user sees where a run went wrong, not just a bad answer.
Problem: RAG systems answer confidently from documents the user can't see. A stale or silently-failed source produces wrong answers that look identical to right ones.
Pattern: make the knowledge layer a first-class screen: what's indexed, chunking, embedding/rerank models, freshness per source — and surface the failed source in red, not in a log.
Why: in domains where being wrong has a cost, the audit trail is the product. Let users disagree with the AI by giving them everything needed to check it.
Problem: AI features have real marginal cost per use, hidden until the invoice. Users can't reason about a tool whose cost is invisible.
Pattern: a readable cost surface — MTD spend, forecast vs budget, cost per run, breakdown by agent / model / tool. The expensive model and the chatty tool become obvious.
Why: cost transparency is what lets a buyer say yes. It turns "AI is unpredictably expensive" into a number they can plan around.
Problem: provider rate limits are not an edge case — they're Tuesday. Most products render a 429 as a generic error toast and a dead feature.
Pattern: treat the limit as a first-class state: which provider throttles, what's already happening automatically (fallback routing), and explicit user choices — raise the tier, stay on fallback, throttle non-essential work — each with its trade-off priced out.
Why: graceful degradation separates products that survive a spike from products that just break. Never make the user guess whether the AI is down or busy.
Problem: "is the AI getting better or worse?" — every team is asked, few can answer on screen. Prompt changes ship, quality silently regresses.
Pattern: a visible quality ledger: weighted score over time, pass threshold, regressions flagged when crossed, judges (model + human) with disagreement rate, each regression traced to the change that caused it.
Why: evals on screen turn AI quality from a vibe into a defensible number — for the team, the buyer, and the regulator.
Problem: a new user opens an AI product to nothing — no data, no examples, no idea what good looks like. This is where most AI tools lose people.
Pattern: honest and directive: don't fake activity, explain what the screen becomes after the first action, one unambiguous primary CTA, and suggest the gentlest first step (a starter template, not a blank canvas).
Why: the first run is the highest-stakes screen in the product. A tool that respects the user's intelligence on a quiet day earns the right to a loud one.
All screenshots are from Atlas, an AI-agent control room (observability, traces, evals, billing, multi-tenant) designed by Aleksey Stepikin. More AI product work: Vigilo — global risk monitor, 44 live sources, 198 countries, built solo.
Found a pattern that belongs here, or a production example that does one of these better? Open an issue or PR.
Text: CC BY 4.0 — cite stepikin.com/llm-ux-patterns. Screenshots: © Aleksey Stepikin, used here for documentation; ask before reuse.





