Yanpeng Qi YanpengQi7

Hi, I'm Yanpeng 👋

SDE building production AI systems — RAG pipelines, agent workflows, and full-stack products.
Eval-first: I measure my AI systems, then I measure the evals themselves.
_{Based in Greater Seattle · Turning ambiguous problems into reliable, verified systems — and publishing what the numbers actually say.}

🛠️ Tech I reach for

🚀 Featured Project

🛡️ AI Reliability Copilot

Turns a production incident — alerts, logs, on-call notes — into a structured 9-section reliability report (severity, ranked root-cause hypotheses, mitigation with rollback, postmortem). Ships as a web app, an MCP server, and a CLI.

But the product isn't the point — the eval pipeline is: every prompt change is scored by an LLM-as-judge across a fixed scenario suite, with repeats and error bars. I treat my own AI like a system under test, and I publish what the numbers say even when it's unflattering:

📉 Found that prompt v1 / v2 / v3 differences were statistical noise, not improvement — the within-cell std was larger than every between-version delta. Caught myself before shipping a "+0.16 quality win" that was sampling noise.

🔬 Cross-checked the judge against an independent model (DeepSeek-judges-DeepSeek vs. Claude Sonnet 4.6). The same-family judge inflates overall scores by +0.24 / 5 (~5%) — concentrated in the soft dimensions (actionability, completeness), zero on safety (90% exact agreement). I had guessed 10–20%; measuring showed I'd overestimated.

Next.js 16 · TypeScript · Vercel AI SDK · DeepSeek · Supabase / pgvector · MCP · Cross-model eval

🧰 More Projects

🧠 mcp-recall

Local-first structured memory for Claude Code over MCP — hybrid retrieval (vector KNN + BM25 fused with Reciprocal Rank Fusion), local embeddings, recency-decay ranking, dedup guard. No API keys; nothing leaves the machine.

MCP · sqlite-vec · Hybrid Retrieval · Local Embeddings

📚 Awesome AI Application Engineer

A practical AI application engineering roadmap — LLM basics → Prompt, RAG, Agent, MCP, evaluation, production — with hands-on tutorials, checklists, templates, and real RAG bad cases.

LLM · RAG · Agent · MCP · Evaluation · Production

🔍 SRE Investigator

MCP-agnostic SRE skill for Claude Code — discovers whatever MCP tools are exposed, classifies them by capability, queries read-only evidence, and keeps evidence separate from inference.

Claude Code Skill · MCP · Prompt design

🗺️ ServiceAtlas

AI SRE knowledge compiler — turns a codebase into source-grounded runbooks, dependency / blast-radius maps, observability-gap reports, and PR reliability-impact analysis.

Python · LLM · Reliability · Runbooks

📊 GitHub Stats

_{Self-hosted via lowlighter/metrics — refreshed daily by GitHub Actions, served straight from this repo (no rate-limited third-party instance).}

✍️ Recent Writing

🌱 Now

Building eval-first AI tooling — repeatable scenarios > manual inspection, and validating the judges themselves (cross-model bias measurement)
Maintaining Awesome AI Application Engineer — a practical roadmap for Chinese developers learning production LLM apps
Exploring MCP as a substrate for SRE / on-call workflows
Open to chat about: production LLM systems, RAG quality, agent evals, MCP, AI infra

📬 Reach me at yanpengqi.com · LinkedIn

_{"AI systems should be grounded in real data, observable when the model is wrong, and evaluated with repeatable scenarios."}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly