Event-driven AI customer-support copilot. Per message: triage → RAG retrieve → grounded draft → policy/QA guard → escalate-or-suggest. Runs 100% locally at $0.
A customer message arrives → the Go API persists it and publishes a message.created event → a Python LangGraph worker consumes it, classifies intent, retrieves grounded knowledge (pgvector), drafts a cited reply, runs a safety/QA guard, and either suggests a draft to a human agent or escalates when confidence is low. Results stream back to a Next.js console via GraphQL subscriptions.
Grounding is mandatory, humans stay in the loop, and an eval harness gates quality in CI so nothing untrustworthy reaches a customer.
flowchart LR
UI["Next.js console"]
R["Go API · gqlgen<br/>(thin resolvers)"]
DB[("Postgres<br/>+ pgvector")]
BUS{{"Redis Streams<br/>(event bus)"}}
LLM[("LLM<br/>Ollama / OpenAI")]
TOOLS["MCP tools<br/>(read-only)"]
subgraph WK["Python LangGraph worker"]
direction LR
T["triage"] --> RT["retrieve"] --> D["draft"] --> G["guard"] --> DEC{"decision"}
DEC -->|repair ×1| D
end
UI -- "mutation" --> R
R -- "persist" --> DB
R == "message.created" ==> BUS
BUS == "consume" ==> T
WK == "draft.ready / escalated" ==> BUS
BUS -- "bridge" --> R
R -. "subscription (live)" .-> UI
RT -. "vector + FTS" .-> DB
D -. "grounded gen" .-> LLM
D -. "order / policy" .-> TOOLS
- Go API (
services/api) — thin: validate, persist, publish. No LLM/agent logic. - Python worker (
workers/agent) — LangGraph state machine; nodes are pure-ish and schema-validated. - Contract — API ↔ worker talk only via the typed event schema (
packages/events/events.schema.json) on Redis Streams. The request's trace id rides on the event, so one OpenTelemetry trace spans API → worker → LLM.
Go + gqlgen · Python + LangGraph · Postgres + pgvector · Redis Streams · MCP-style tools · Next.js + Tailwind + shadcn/ui · Ollama (default) / OpenAI-compatible · OpenTelemetry · GitHub Actions + Docker.
Prerequisites: Docker. (Local dev also: Go 1.25+, Python 3.12+, Node 22+.)
cd resolver_code
cp .env.example .env # defaults run fully local / $0 — no keys required
make up # build + boot full stack (postgres+pgvector, redis, ollama, migrate, api, worker, web)
make models # pull qwen2.5:3b, qwen2.5:7b, nomic-embed-text (first run only)
make ingest # Bitext -> KB + embeddings + held-out golden set
make eval # run the eval harness -> report + eval_runs row- Console: http://localhost:3000 — queue, conversation view, live drafts, dashboard.
- GraphQL playground / health: http://localhost:8080 ·
/healthz. - Traces (optional):
docker compose --profile observability up jaeger, setOTEL_TRACES_EXPORTER=otlp, open http://localhost:16686.
On a CPU-only box, drafting with the local 3b/7b models is slow (minutes per draft). Point
LLM_PROVIDER=openaiat a hosted/compatible endpoint for fast responses — see Models & providers.
The gqlgen-generated Go files are not committed; the Docker build and
make gqlgenregenerate them frompackages/graphql/schema.graphql.
resolver_code/
├── apps/web Next.js agent console (streaming) [Phase 4]
├── services/api Go + gqlgen GraphQL API [Phase 1]
├── workers/agent Python LangGraph graph, rag/, tools/ [Phase 3]
│ ├── graph/nodes triage · retrieve · draft · guard · decision · repair
│ ├── rag/ embeddings, hybrid (vector + FTS) search, RRF + re-rank
│ ├── tools/ read-only MCP tools (audited, allow-listed)
│ └── llm/ provider adapters (ollama / OpenAI-compatible)
├── pipeline/ ingest_bitext.py + eval/ [Phase 2/5]
├── packages/graphql shared schema + codegen TS types
├── packages/events events.schema.json (event contract)
├── db/migrations versioned SQL migrations
├── deploy/ docker-compose.yml
└── data/ golden.jsonl, samples (large files gitignored)
The LLM provider is an env switch behind one interface — no code change to swap.
- Default (local, $0): Ollama. Model tiering reflects cost/quality:
qwen2.5:3bfor triage/classification,qwen2.5:7bfor drafting and the eval judge,nomic-embed-text(768-dim) for embeddings. Pull them withmake models. - Hosted: set
LLM_PROVIDER=openaiandOPENAI_API_KEY(optionallyOPENAI_BASE_URLfor any OpenAI-compatible endpoint). The worker's chat + embeddings switch with no code change; a missing key fails loudly at startup.
Tradeoffs: local 3b/7b on CPU is slow (minutes per draft) but free and private; grounding/guard are deterministic so safety holds regardless of model strength. A hosted model raises answer quality and speed at a per-token cost (tracked per draft as cost_cents). Generation length is bounded by DRAFT_NUM_PREDICT to cap latency/cost.
OpenTelemetry traces span the whole path: the API starts a trace per request and
stamps its trace id into the message.created event, so the worker continues
the same trace across the bus (API → worker → graph/LLM). Per-draft tokens, cost,
and latency are recorded on the draft and as span attributes; structured JSON logs
carry conversation/trace ids (no secrets/PII at info).
Exporter is env-controlled (OTEL_TRACES_EXPORTER): console (default — spans in
logs, $0, no extra service), otlp (ships to OTEL_EXPORTER_OTLP_ENDPOINT), or
none. For a trace UI: docker compose --profile observability up jaeger, set
OTEL_TRACES_EXPORTER=otlp, and open Jaeger at localhost:16686.
The design choices, and what they demonstrate:
- "LLM proposes, evals + guards dispose." Every generated answer must cite retrieved KB sources; a deterministic guard (grounding + tone + a forbidden-action allow-list) and a confidence threshold decide suggest vs escalate. An eval harness gates groundedness/routing/safety in CI. Quality is enforced by code, not vibes.
- The LangGraph state machine is the source of truth for control flow (
triage → retrieve → draft → guard → decision → {finalize | repair | escalate}). Nodes are pure-ish and schema-validated, so each is unit-testable and the whole graph is inspectable. - Event-driven Go ↔ Python contract. The thin Go API never calls an LLM; it validates, persists, and publishes a typed event. All AI work lives in the Python worker. They communicate only through the versioned event schema on Redis Streams — independently deployable, independently scalable.
- Human-in-the-loop safety by construction. Nothing auto-sends below the confidence threshold; irreversible actions (refunds, cancellations) are never executed — only proposed as a human task. Tools are read-only, allow-listed, and audited.
- Cost/model tiering and local-first. Small model for triage, stronger for drafting; embeddings cached; per-draft tokens/cost recorded. Runs 100% locally at $0 on Ollama, or switches to a hosted provider with one env var.
Built phase-by-phase:
- Phase 0 — foundation & local infra: monorepo skeleton, docker-compose stack, DB migrations (pgvector + HNSW), typed event contract. ✅
- Phase 1 — Go GraphQL API: schema-first gqlgen API (thin resolvers → service → pgx store), Redis Streams pubsub bridge,
ingestMessagepersists + publishesmessage.created, draft subscription wiring, graceful shutdown, containerized viadocker compose up. ✅ - Phase 2 — Dataset → KB & RAG ingestion:
make ingestloads Bitext, holds out a stratified golden set (data/golden.jsonl), builds deduped KB docs, embeds them with provider-agnostic embeddings (Ollamanomic-embed-text, 768-dim), and upserts to pgvector with an HNSW index. ✅ - Phase 3 — LangGraph worker: consumes
message.created(Redis consumer group, idempotent by event id, retries + dead-letter), runs the agent graphtriage → retrieve → draft → guard → decision → {finalize \| repair \| escalate}with schema-validated node outputs, persists a groundedSUGGESTEDdraft (orESCALATED) with citations + guard report + token cost, and publishesdraft.ready/draft.escalated. Forbidden actions are blocked deterministically — never finalized. ✅ - Phase 4 — Web agent console: Next.js (App Router) + Tailwind + shadcn/ui console with a typed urql GraphQL client (codegen off the shared schema). Queue with status filter and pagination, conversation view (message thread + full draft panel: confidence meter, grounding sources, guard report), live
draftUpdatesstreaming over graphql-ws, and human-in-the-loop actions (approve/edit →SENT, reject, escalate). Verified live end-to-end: ingest → triage/draft streams in → approve. ✅ - Phase 5 — Eval harness & CI gate:
make evalruns the real agent graph over the held-out golden set and scores routing (category), retrieval recall@k, groundedness, LLM-judge answer quality, safety (zero forbidden actions), and cost/latency — writingpipeline/eval/reports/REPORT.mdand aneval_runsrow. Gated on the PRD §4 numbers (groundedness ≥90%, routing ≥85%, safety 0); the run exits non-zero otherwise. GitHub Actions CI runs Go/Python/web tests, builds all images, and runs the sampled eval gate. ✅ - Phase 6 — P1 enhancements: hybrid retrieval (pgvector + Postgres FTS → Reciprocal Rank Fusion → lexical rerank); read-only, allow-listed, audited MCP tools wired into drafting; priority queue ordering (urgency/sentiment, composite-cursor pagination); a quality dashboard (auto-draft/escalation rates, cost, p95, eval trend); hosted/local LLM provider switch via env; and OpenTelemetry tracing end-to-end (the API trace id propagates onto the event so the worker continues the same trace). ✅
MIT — see LICENSE.