SDE building production AI systems β RAG pipelines, agent workflows, and full-stack products.
Eval-first: I measure my AI systems, then I measure the evals themselves.
Based in Greater Seattle Β· Turning ambiguous problems into reliable, verified systems β and publishing what the numbers actually say.
π‘οΈ AI Reliability Copilot Β 
Turns a production incident β alerts, logs, on-call notes β into a structured 9-section reliability report (severity, ranked root-cause hypotheses, mitigation with rollback, postmortem). Ships as a web app, an MCP server, and a CLI.
But the product isn't the point β the eval pipeline is: every prompt change is scored by an LLM-as-judge across a fixed scenario suite, with repeats and error bars. I treat my own AI like a system under test, and I publish what the numbers say even when it's unflattering:
π Found that prompt v1 / v2 / v3 differences were statistical noise, not improvement β the within-cell std was larger than every between-version delta. Caught myself before shipping a "+0.16 quality win" that was sampling noise.
π¬ Cross-checked the judge against an independent model (DeepSeek-judges-DeepSeek vs. Claude Sonnet 4.6). The same-family judge inflates overall scores by +0.24 / 5 (~5%) β concentrated in the soft dimensions (actionability, completeness), zero on safety (90% exact agreement). I had guessed 10β20%; measuring showed I'd overestimated.
Next.js 16 Β· TypeScript Β· Vercel AI SDK Β· DeepSeek Β· Supabase / pgvector Β· MCP Β· Cross-model eval
π§ mcp-recallLocal-first structured memory for Claude Code over MCP β hybrid retrieval (vector KNN + BM25 fused with Reciprocal Rank Fusion), local embeddings, recency-decay ranking, dedup guard. No API keys; nothing leaves the machine.
|
A practical AI application engineering roadmap β LLM basics β Prompt, RAG, Agent, MCP, evaluation, production β with hands-on tutorials, checklists, templates, and real RAG bad cases.
|
π SRE InvestigatorMCP-agnostic SRE skill for Claude Code β discovers whatever MCP tools are exposed, classifies them by capability, queries read-only evidence, and keeps evidence separate from inference.
|
πΊοΈ ServiceAtlasAI SRE knowledge compiler β turns a codebase into source-grounded runbooks, dependency / blast-radius maps, observability-gap reports, and PR reliability-impact analysis.
|
Self-hosted via lowlighter/metrics β refreshed daily by GitHub Actions, served straight from this repo (no rate-limited third-party instance).
- π RAG quality is mostly retrieval design
- π Agent systems need evals before they need more tools
- π Model routing is a product decision, not just an optimization
- π Capacity planning for AI products starts with traffic shape
- Building eval-first AI tooling β repeatable scenarios > manual inspection, and validating the judges themselves (cross-model bias measurement)
- Maintaining Awesome AI Application Engineer β a practical roadmap for Chinese developers learning production LLM apps
- Exploring MCP as a substrate for SRE / on-call workflows
- Open to chat about: production LLM systems, RAG quality, agent evals, MCP, AI infra
π¬ Reach me at yanpengqi.com Β· LinkedIn
"AI systems should be grounded in real data, observable when the model is wrong, and evaluated with repeatable scenarios."
