Skip to content
View YanpengQi7's full-sized avatar
🏠
Working from home
🏠
Working from home

Block or report YanpengQi7

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
YanpengQi7/README.md

Hi, I'm Yanpeng πŸ‘‹

SDE building production AI systems β€” RAG pipelines, agent workflows, and full-stack products.
Eval-first: I measure my AI systems, then I measure the evals themselves.
Based in Greater Seattle Β· Turning ambiguous problems into reliable, verified systems β€” and publishing what the numbers actually say.

Portfolio Blog LinkedIn AI Engineer Roadmap CV


πŸ› οΈ Tech I reach for


πŸš€ Featured Project

πŸ›‘οΈ AI Reliability Copilot Β  Demo

Turns a production incident β€” alerts, logs, on-call notes β€” into a structured 9-section reliability report (severity, ranked root-cause hypotheses, mitigation with rollback, postmortem). Ships as a web app, an MCP server, and a CLI.

But the product isn't the point β€” the eval pipeline is: every prompt change is scored by an LLM-as-judge across a fixed scenario suite, with repeats and error bars. I treat my own AI like a system under test, and I publish what the numbers say even when it's unflattering:

πŸ“‰ Found that prompt v1 / v2 / v3 differences were statistical noise, not improvement β€” the within-cell std was larger than every between-version delta. Caught myself before shipping a "+0.16 quality win" that was sampling noise.

πŸ”¬ Cross-checked the judge against an independent model (DeepSeek-judges-DeepSeek vs. Claude Sonnet 4.6). The same-family judge inflates overall scores by +0.24 / 5 (~5%) β€” concentrated in the soft dimensions (actionability, completeness), zero on safety (90% exact agreement). I had guessed 10–20%; measuring showed I'd overestimated.

Next.js 16 Β· TypeScript Β· Vercel AI SDK Β· DeepSeek Β· Supabase / pgvector Β· MCP Β· Cross-model eval


🧰 More Projects

🧠 mcp-recall

Local-first structured memory for Claude Code over MCP β€” hybrid retrieval (vector KNN + BM25 fused with Reciprocal Rank Fusion), local embeddings, recency-decay ranking, dedup guard. No API keys; nothing leaves the machine.

MCP Β· sqlite-vec Β· Hybrid Retrieval Β· Local Embeddings

A practical AI application engineering roadmap β€” LLM basics β†’ Prompt, RAG, Agent, MCP, evaluation, production β€” with hands-on tutorials, checklists, templates, and real RAG bad cases.

LLM Β· RAG Β· Agent Β· MCP Β· Evaluation Β· Production

MCP-agnostic SRE skill for Claude Code β€” discovers whatever MCP tools are exposed, classifies them by capability, queries read-only evidence, and keeps evidence separate from inference.

Claude Code Skill Β· MCP Β· Prompt design

πŸ—ΊοΈ ServiceAtlas

AI SRE knowledge compiler β€” turns a codebase into source-grounded runbooks, dependency / blast-radius maps, observability-gap reports, and PR reliability-impact analysis.

Python Β· LLM Β· Reliability Β· Runbooks


πŸ“Š GitHub Stats

GitHub metrics

Self-hosted via lowlighter/metrics β€” refreshed daily by GitHub Actions, served straight from this repo (no rate-limited third-party instance).


✍️ Recent Writing


🌱 Now

  • Building eval-first AI tooling β€” repeatable scenarios > manual inspection, and validating the judges themselves (cross-model bias measurement)
  • Maintaining Awesome AI Application Engineer β€” a practical roadmap for Chinese developers learning production LLM apps
  • Exploring MCP as a substrate for SRE / on-call workflows
  • Open to chat about: production LLM systems, RAG quality, agent evals, MCP, AI infra

πŸ“¬ Reach me at yanpengqi.com Β· LinkedIn

"AI systems should be grounded in real data, observable when the model is wrong, and evaluated with repeatable scenarios."

Popular repositories Loading

  1. ai-reliability-copilot ai-reliability-copilot Public

    Turn a production incident into a structured 9-section LLM response (severity, root cause, mitigation, postmortem). Ships with a 5-scenario regression suite + LLM-as-judge eval pipeline.

    TypeScript 28

  2. study-abroad-platform study-abroad-platform Public archive

    Study-abroad school-matching demo β€” auth, academic profile intake, and a reach/match/safety recommendation algorithm. React + Node + Supabase.

    TypeScript 1

  3. portfolio-site portfolio-site Public

    Personal portfolio with AI chatbot β€” Next.js 16, Gemini Flash, RAG, multi-agent

    TypeScript 1

  4. sre-investigator sre-investigator Public

    MCP-agnostic SRE investigation skill for Claude Code β€” discovers tools, queries read-only evidence, keeps evidence separate from inference.

    1

  5. YanpengQi7 YanpengQi7 Public

    GitHub profile README for Yanpeng Qi

    1

  6. mcp-recall mcp-recall Public

    Local-first structured memory for Claude Code over MCP β€” semantic recall of incidents, code reading, decisions, and lessons. SQLite + sqlite-vec, local embeddings, no API keys.

    TypeScript 1