diff --git a/.gitignore b/.gitignore index d5f8eb5..f5ffdee 100644 --- a/.gitignore +++ b/.gitignore @@ -5,7 +5,7 @@ dbt_packages/ logs/ # Python virtual environment -venv/ +.venv/ .dbt/ # Local secrets — copy from .env.example and fill in diff --git a/demo/plan_pr2.md b/demo/plan_pr2.md new file mode 100644 index 0000000..ef8d387 --- /dev/null +++ b/demo/plan_pr2.md @@ -0,0 +1,114 @@ +# Chat evals plan — Stage 2 (stabilization) + +**Branch:** `chat-evals-stabilization` +**Status:** Stage 1 merged in [**#86**](https://github.com/appspace/kwwhat/pull/86) (2026-05-07) — that is the baseline we build from. + +--- + +## Stage 2 — current focus + +### Goal + +Strengthen repeatable evals: golden dataset, rubrics, light automation, stable result format. No blocking on upstream nao. + +### Scope + +- **Golden dataset** — single-turn; ~10–12 entries; fields: `question_id`, `user_input`, `reference_answer`, `reference_contexts` (or `source_refs`), optional `human_explanation`. + - `reference_contexts` = evidence pointers (`RULES.md`, `semantic_models.yml`, `marts.yml`) for traceability and review — **not** extra chunks injected into model prompts. +- **G-Eval rubrics** — e.g. Terminology (no `session`; source: [`RULES.md`](demo/chat-bi/RULES.md)), Rate Format (%, pp), Metric Validity (only metrics defined in [`semantic_models.yml`](models/semantic/semantic_models.yml)), Completeness vs expected output — aim for ≥1 custom rubric. +- **`results_summary.py`** — lightweight script to generate `stage1_analysis.csv` and `stage1_delta_summary.md` from `results_*.json`; first automation step in Stage 2. +- **Standardize result fields** — `semantic_metric`, `semantic_score`, `semantic_threshold`, `semantic_pass`, `semantic_reason`, `judge_model`. +- **Upstream nao direction** — native generic/context evals in `nao test` via upstream issue/contribution; **parallel track**, not a Stage 2 blocker; local prep proceeds independently. +- **Optional CI** — add gate once rubrics and flake are under control. +- **Plan doc** — this file (`plan_pr2.md`) on branch `chat-evals-stabilization`. + +### Visualization in Stage 2 + +Keep lightweight: tables and short markdown recaps in PRs. No standalone dashboard deliverable yet. + +### Post–Stage 1 calibration (carry into Stage 2 runs) + +Tune with real runs and document outcomes in Stage 2 PRs: + +- Acceptable **cost** and **execution_time** per run. +- Numeric / ordering **tolerance** thresholds. +- **Priority** among error categories (`hallucination` vs `wrong_filter` vs `sql_logic`). +- Top **eval scenarios** to grow first (e.g. partner support flows when ready). + +--- + +## Stage 3 — direction (branch `chat-evals-analytics`, after Stage 2 merge) + +- **Streamlit viewer** — color-coded summary (questions × rubrics), drill-down per row (actual vs expected + reason per rubric), aggregate trends per rubric across runs. +- **Chat replays (optional source)** — use nao chat replay storage as a read-only source of candidate eval cases; sanitize/anonymize before dataset inclusion. +- **Long-term storage** of eval history. +- **Stricter org-wide gates.** +- **Plan doc** — `plan_pr3.md` on branch `chat-evals-analytics` (after Stage 2 merge). + +--- + +## Out of scope + +- **deepeval** full integration into demo runtime or CI as a **required** merge gate — unless explicitly approved. +- **Sidecar/post-processing G-eval scripts** as a required Stage 2 gate before rubrics are stable. +- **Standalone BI dashboard** (Power BI / Metabase on DuckDB) — duplicate effort; revisit only with clear value. +- **Tableau Public** — optional later if portfolio case warrants it. +- **MCP / external agent distribution** — separate parallel track; not part of this eval roadmap. + +--- + +## Stage 1 — completed reference + +> Full execution checklists and PR alignment live in [**#86**](https://github.com/appspace/kwwhat/pull/86). This section is a compact record to orient Stage 2. + +### Goal (achieved) + +Test whether **curated project context** drives correct chat behavior — not the LLM product, not RAG — and catch regressions when [`agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`RULES.md`](demo/chat-bi/RULES.md), [`nao_config.yaml`](demo/chat-bi/nao_config.yaml), [`semantic_models.yml`](models/semantic/semantic_models.yml), or [`marts.yml`](models/marts/marts.yml) change. + +### Delivered in #86 + +- **SQL eval tests** in `demo/chat-bi/tests/*.yml` — `decommissioned_ports_check`, `lately_snapshot`, `network_reliability_uptime` added alongside existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml). +- **Analysis artifacts** — [`stage1_analysis.csv`](demo/chat-bi/tests/analysis/stage1_analysis.csv) and [`stage1_delta_summary.md`](demo/chat-bi/tests/analysis/stage1_delta_summary.md) filled for the expanded `nao test` run (2 passed, 2 failed — both failures: SQL binder error `Catalog "RAW" does not exist`, tracked as context/namespace drift signal). +- **Reproducible outputs** — Docker volume: host `demo/chat-bi/tests/outputs/` ↔ `/app/kwwhat/tests/outputs/`; `results_*.json` reachable without `docker cp` (see [`demo/README.md`](demo/README.md)). +- **Prompt wording** tightened to match SQL assertions. +- **Ignore rules** — local run noise excluded from commits. + +### Approach locked for Stage 1 (carries forward as baseline convention) + +- `nao test` + SQL = **factual** reference layer. +- Manual `semantic_label` / `failure_reason` = lightweight semantic layer; no automated G-eval judge in Stage 1. +- YAML cases called **"SQL tests"**; baseline shape = [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) structure. +- Storage: nao JSON pattern, Docker volume path above. +- **Upstream native evals** = parallel Stage 2 input, not a Stage 1 blocker. + +### Decisions locked with #86 + +- `semantic_label` values: `correct` = accurate and complete; `partial` = correct conclusion but imprecise or incomplete; `incorrect` = factual error or hallucination. +- `status` and `semantic_label` are **independent layers** — harness pass/fail vs human rubric. +- Do not bulk-commit raw `results_*.json`; keep selected analysis artifacts only. + +### Mini analysis table columns (Stage 1; extend for Stage 2) + +`question`, `actual`, `expected` (optional), `status`, `semantic_label`, `failure_reason` (if fail), `cost`, `execution_time` (+ `run_id`, `model`, etc. already used in CSV). + +--- + +## Test design constraints (all stages) + +**Rules:** [`RULES.md`](demo/chat-bi/RULES.md), [`agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`nao_config.yaml`](demo/chat-bi/nao_config.yaml) — DuckDB only, no schema introspection, only `fact_*`/`dim_*` tables (`analytics.ANALYTICS.`), default time window last 7 days. + +**SQL:** `sql:` must use real columns from [`marts.yml`](models/marts/marts.yml) and only metrics defined in [`semantic_models.yml`](models/semantic/semantic_models.yml). + +**Time windows:** if expected SQL has no date filter and aggregates over all available data, the prompt must explicitly ask for "full history"; otherwise the default last-7-days rule applies. + +**Reproducibility:** use `dbt run --full-refresh` for demo test setup when previous incremental state could affect eval outputs. + +**Narrative:** `charge attempt`, `transaction`, `visit` — never `session`. Presentation rules (%, pp) apply to **answer quality** evaluation, not to the `sql:` block. + +**No extra context in test prompts** — no injected chunks; tests exercise curated nao context only (`agent_instructions.md`, `RULES.md`, etc.). + +--- + +## Scratch / course notes (ephemeral) + +_Bullets from evals course / syncs. Trim or delete before merging to `main`._ diff --git a/demo/thoughts.md b/demo/thoughts.md new file mode 100644 index 0000000..ac5d0e7 --- /dev/null +++ b/demo/thoughts.md @@ -0,0 +1,291 @@ +# Thoughts — golden dataset format + +## What we're building + +**Goal** — implement an evals framework that quantifies the impact of context changes on the nao Chat BI tool. We are not testing the LLM or general chat performance. We are testing one specific thing: did a change to the context — RULES.md, semantic model definitions, or similar input files — make the assistant's answers better or worse? The eval score is a signal for context quality, not model quality. Nao already has SQL tests in place that guard against schema linking failures and semantic gaps. We are looking to add non-deterministic evals that catch failures when the SQL and even the number is correct. + + +## Open questions + +- [ ] Referenceless (RAG triad) or reference-based (Correctness), or both? +- [ ] Use DeepEval's built-in metrics or maintain custom judge prompts? +- [ ] RAG triad requires nao to maintain a list of content-bearing tools (`execute_sql`, `read`, `grep`) — acceptable ongoing cost? If a new tool is added (e.g. `fetch_api`, `query_vector_store`), the team must decide whether its output is grounding content or navigation metadata and update the list. + +## Acceptance criteria + +- [ ] Data teams define test cases in `tests/evals/` alongside existing `tests/*.yml` SQL tests +- [ ] `nao evals` runs from a project directory and produces a JSON report in `tests/outputs/` +- [ ] Report includes pass/fail per test case, per-metric scores and reasons, and a summary +- [ ] Exit code is non-zero when any case fails — `nao evals` is CI-friendly +- [ ] Golden dataset lives in `tests/evals/golden_dataset.jsonl` — no boilerplate beyond `id`, `input`, `expected_output` + +--- + +## Design choices + +We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this project. Here is what each term means: + +**LLM-as-a-judge** — instead of checking outputs with deterministic rules or exact string matches, a second LLM (the "judge") reads the assistant's response and scores it against expected answer. This handles the inherent non-determinism of Chat BI. + +**Single-turn** — each eval entry is a single, atomic unit of interaction with the LLM app: one user input, one assistant response, no conversation history. The assistant is evaluated on what it says in that one reply, in isolation. + +**End-to-end** — we evaluate the observable input and output of the Chat BI system and treat it as a black box. We do not instrument internal steps — no retrieval spans, no tool call traces, no sub-agent scoring. We care about the result the user sees, not the path the system took to produce it. This is the right fit for context-change evals: if the answer improved, the context change worked, regardless of what happened inside. + +**Local-first** — the eval harness runs entirely on the developer's machine, with no external eval platform or cloud service required. The golden dataset, judge prompts, scores, and results all live in this repository. This keeps the feedback loop fast, keeps data private, and means the eval is as easy to run as any other dbt command. + +**Same model as judge** — we plan to use the same model that powers Chat BI as the judge. This simplifies setup: no second API key, no model version management, no configuration drift. The trade-off is that models tend to score their own outputs more favorably — a known bias in self-evaluation. We accept this for now in exchange for simplicity, and can swap in a separate judge model later if scores prove unreliable. + +**Eval framework (deepeval or similar)** — rather than hand-rolling judge prompts, we intend to use an established eval framework. This means the noa labs team does not need to become prompt engineering experts just to run evaluations — the framework owns the judge prompt design and scoring logic. It also provides a foundation that can be extended to multi-turn evals later without rebuilding from scratch. + +--- + +### Metric strategy: referenceless vs reference-based + +Two approaches, not mutually exclusive. + +**Option A — Referenceless: RAG triad (Faithfulness, Contextual Relevancy, Answer Relevancy)** + +The RAG triad is a proxy framework. It assumes correctness flows downstream from three conditions: +(1) the context retrieved was relevant to the question, (2) the answer was grounded in that context, +and (3) the answer addressed what was asked. If all three hold, a correct answer is likely — but +it is never directly verified. The triad deliberately dodges the hard problem: does the actual output +match ground truth? The reason it does so is that ground truth is assumed to be expensive and +ambiguous to define. + +For nao this means the triad catches hallucinations and off-topic answers but will not catch a +response that is faithful, relevant, and still factually wrong — for example, an answer grounded in +context that itself contains a stale or incorrect value. + +These are purpose-built metrics — DeepEval already knows which `LLMTestCase` fields each one needs: +- `FaithfulnessMetric` → `input`, `actual_output`, `retrieval_context` +- `ContextualRelevancyMetric` → `input`, `retrieval_context` +- `AnswerRelevancyMetric` → `input`, `actual_output` + +```python +from deepeval.metrics import FaithfulnessMetric, ContextualRelevancyMetric, AnswerRelevancyMetric +from deepeval.test_case import LLMTestCase + +test_case = LLMTestCase( + input="How many ports are currently decommissioned?", + actual_output="There are 4 decommissioned ports.", + retrieval_context=["[SQL result]\ndecommissioned_ports: 4"], +) + +faithfulness_metric = FaithfulnessMetric(threshold=0.7, model=judge, include_reason=True) +context_relevancy_metric = ContextualRelevancyMetric(threshold=0.5, model=judge, include_reason=True) +answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model=judge, include_reason=True) +``` + +**Option B — Reference-based: Correctness (GEval)** + +Every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in whether the agent arrived at the expected answer rather than asking it to reason from the rubric alone. + +Compares the actual output directly against the expected output. Requires a `expected_output` for +every golden record. Catches factual errors the triad misses — if the number is wrong, the score +drops regardless of how grounded the answer is. + +`GEval` is a blank-slate metric — you write the evaluation steps yourself, so you must explicitly declare which `LLMTestCase` fields to pass via `evaluation_params`. + +```python +from deepeval.metrics import GEval +from deepeval.test_case import LLMTestCase, LLMTestCaseParams + +test_case = LLMTestCase( + input="How many ports are currently decommissioned?", + actual_output="There are 4 decommissioned ports.", + expected_output="There are 4 decommissioned ports.", +) + +correctness_metric = GEval( + name="Correctness", + evaluation_steps=[ + "Compare the actual output directly with the expected output to verify factual accuracy.", + "Check if all elements mentioned in the expected output are present and correctly represented in the actual output.", + "Assess if there are any discrepancies in details, values, or information between the actual and expected outputs.", + ], + evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], +) +``` + +The cost: `expected_output` must be maintained. For nao's SQL-answerable questions this is low +— the expected answer can be generated by running the SQL directly. For open-ended or exploratory +questions it is harder to define and more likely to drift as the data changes. + +**Recommended: both together** + +The triad and Correctness are complementary. The triad diagnoses *why* something failed (wrong +context, hallucination, off-topic). Correctness catches *whether* it failed. Running both gives a +richer signal: a response can pass Faithfulness but fail Correctness (grounded in context but the +context was wrong), or pass Correctness but fail Faithfulness (lucky correct answer not actually +supported by what the agent saw). + +The `expected_output` field is already in every golden record. Adding Correctness requires one +additional fixture in `conftest.py` and one line in the `metrics=[]` list — no schema changes. + +--- + +# Metrics borrow from RAG triad + +RAG retrives context, which is different from what nao's users are doing - they curate context. We can account for this difference and reuse RAG metrics. + +RAG traid tests relationship b/w three entities: Question, Context and Response with metrics around 1. `Context Relevance` - is the context retried relevant to the question; 2. answer `Faithfulness` - was the answer grounded in retrived context; 3. `Answer Relevance` - is the answer relevant to what was asked. We can re-use this methodology by substituting retrived context with curated context. So the question changes from `Was the right context retrived` to `Was the right context curated`? + +| Metric | Inputs | What it checks | +|--------|--------|-------| +| Context Relevance | input and curated_context | Was the context relevant to the question? | +| Faithfulness | input, actual_output, and curated_context | Is every claim grounded in the curated context — no hallucinations? | +| Answer relevance | input and actual_output | Does the response address what the user actually asked? | +| Completeness | input and actual_output | Did the response cover the full scope of the question at the right level of detail? | + +**How curated context is captured at runtime** + +nao does not have a retrieval step — its context is the set of tool calls the agent made while composing its answer. Tool calls that return actual content (e.g. `execute_sql`, `read`, `grep`) become the curated context passed to the judge. Tool calls that return only file metadata (e.g. `list`, `search`) are excluded — they are navigation, not grounding. + +--- + +## Framework options + +| | DeepEval | Latitude | In-house | +|--|----------|----------|----------| +| **Approach** | Pre-built generic metric library | Evals derived from production failures and expert judgment | Hand-rolled judge prompts | +| **Best for** | Pre-production unit testing — useful when there is no production traffic yet | Post-production — requires real failure data to work from | Full control, no dependencies | +| **Main value** | No prompt engineering required; plug in metrics and run | Evals grounded in actual user failures, not generic rubrics | Fully tailored to the use case | +| **Main risk** | Score quality depends on golden dataset quality — the framework is only as good as the reference answers | Not useful pre-production; concedes this itself | Requires prompt engineering expertise; hard to maintain | +| **Multi-turn path** | Built in | Built in | Rebuild from scratch | + +**Why this matters for nao:** we are targeting reference-based evals and assume end users arrive with a golden dataset. This changes the comparison significantly. The "generic metrics lie" concern — the main argument against DeepEval — is largely neutralised when every entry has a `reference_answer`: the judge is not scoring in the abstract, it is comparing against a concrete expected output. Latitude's value proposition (evals grounded in real production failures) does not apply here; a golden dataset replaces the need for production traffic. In-house also becomes more viable since comparing against a reference answer is a simpler judge prompt than scoring on abstract rubrics — but DeepEval still wins on setup cost and the multi-turn path for the noa labs team. + +--- + +## Do not use the rest +--- + +## Proposed entry format + +```yaml +- question_id: q009 + category: metric_validity + eval_type: LLM-as-judge + user_input: "What is the charge point availability index?" + primary_context: models/semantic/semantic_models.yml + reference_answer: | + The metric "charge point availability index" is not defined in the semantic model. + The closest available metric is **uptime**. Would you like that instead? + reference_contexts: + - file: models/semantic/semantic_models.yml + hint: only metrics defined here are valid + - file: demo/chat-bi/RULES.md + hint: do not make up metrics + human_explanation: Model must decline and redirect — no hallucinated metrics. +``` + +## Field notes + +| Field | Purpose | +|-------|---------| +| `category` | groups entries by rubric (`metric_validity`, `terminology`, `rate_format`, `completeness`) | +| `eval_type` | `LLM-as-judge` or `sql` — determines how the harness evaluates the answer | +| `primary_context` | single file path for reviewer traceability — not injected into prompts | +| `reference_contexts` | supporting evidence pointers for reviewer traceability; not injected into prompts | +| `reference_answer` | what a correct answer looks like — format and intent, not exact match | +| `human_explanation` | one line; captures the rubric intent for the judge prompt | + +## Categories to consider + +| Category | Criteria | Why it matters | Approach | +|----------|----------|----------------|----------| +| `sql_test` | Exact match against SQL assertion | catches regressions in factual outputs immediately | rule | +| `metric_validity` | Did the model avoid inventing metrics not defined in the context? | prevents made-up KPIs from reaching dashboards and decisions | LLM-as-judge | +| `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? | prevents hallucinations, ensures traceable answers, builds user trust | LLM-as-judge | +| `answer_relevance` | Does the answer address the user input? | ensures the response is useful, not just grounded in context | LLM-as-judge | +| `terminology` | Did the model use correct vocabulary ("charge attempt" / "visit", never "session")? | keeps domain language consistent across the product and reports | LLM-as-judge | +| `completeness` | Does the response cover everything the user input asked for? Is it complete? Is it at the right level of detail? | avoids partial answers that require follow-up or cause misinterpretation | LLM-as-judge | + +## Open question + +Should `primary_context` point to the file that lets a reviewer **verify** the answer +(e.g. `semantic_models.yml` to confirm the metric doesn't exist), or the file that +states the **rule** being tested (e.g. `RULES.md` — "do not make up metrics")? + +--- + +## Eval prompt + +Two parts: a **system prompt** (static, judge persona) and a **user prompt** (rendered per entry). + +### System prompt + +``` +You are an evaluator for a BI chat assistant that answers questions about EV charging networks. +Your job is to score the assistant's response against the reference answer and criteria provided. +Be strict and concise. Return only valid JSON. +``` + +### User prompt template + +``` +## User input +{{ user_input }} + +## Reference answer +{{ reference_answer }} + +## Evaluation criteria +Category: {{ category }} +Criteria: {{ criteria }} +{{ human_explanation }} + +## Actual response +{{ actual_response }} + +**ANALYZE THE ACTUAL RESPONSE FOR THIS USER INPUT** + +Assess: + +1. **usefulness** ("high" / "medium" / "low"): How closely does the actual response align with the reference answer? + - "high": Matches the reference answer in intent, format, and key facts + - "medium": Partially aligned — correct intent but missing format or key details + - "low": Misaligned — wrong intent, wrong facts, or contradicts the reference answer + +2. **signal_pct** (0–100): What percentage of this actual response is RELEVANT to the reference answer? + - 80–100: Highly focused, almost all content maps to the reference answer + - 50–79: Majority relevant, some content not in the reference answer + - 20–49: Less than half maps to the reference answer + - 0–19: Almost no overlap with the reference answer + +3. **note**: Brief explanation (1 sentence) + +**KEY TRADEOFFS TO CONSIDER:** +- A short reply can have HIGHER signal than a long one — brevity is not a flaw if the answer is complete + +**BE STRICT:** +- Do not give "high" if the response is missing key facts or format requirements from the reference answer +- Do not round up signal_pct — penalize filler, hedging, or content not grounded in the reference answer +- A polite but wrong answer is still "low" + +**OUTPUT FORMAT (JSON):** +{ + "question_id": "{{ question_id }}", + "category": "{{ category }}", + "summary": { + "usefulness": "high" | "medium" | "low", + "signal_pct": 0-100, + "note": "" + } +} +``` + +### Notes + +- `primary_context` is a reviewer traceability pointer — not injected into the judge prompt. +- The judge model goes into a `judge_model` field on the result record (not in the prompt itself). + +## Execution strategy + +**Prompt chaining** — break eval into stages where each stage feeds the next: + +1. **Retrieve** — fetch the actual response for a `question_id` from the chat log +2. **Judge** — render the user prompt template and call the judge model; get back the JSON score +3. **Aggregate** — collect scores across entries, compute pass rates per category + +Each stage is a discrete LLM call (or SQL step). The output of one becomes the input of the next. This keeps each prompt focused and makes failures easy to isolate. diff --git a/plan.MD b/plan.MD new file mode 100644 index 0000000..9d3bc08 --- /dev/null +++ b/plan.MD @@ -0,0 +1,401 @@ +# Plan: LLM-as-Judge RAG Triad Evals for nao + +## Background + +nao's agent answers questions by reading file-based project context (schema docs, `columns.md`, +`preview.md`) and executing SQL. When a user asks a question, the agent: + +1. Reads relevant files from the project folder (`read`, `grep`, `search`, `list` tools) +2. Executes SQL queries (`execute_sql` tool) +3. Composes a natural language response from those tool outputs + +The **curated context** for evals is the subset of tool outputs that carry actual +content: `execute_sql`, `read`, `grep`. Both `list` and `search` are excluded — they return only +file metadata (path/dir/size), not file content. They are navigation tools the agent uses to +decide what to read next, not to compose its answer. + +--- + +## Two Approaches to Curated Context + +### Option A — Static: bake context into the golden dataset + +Pre-record the tool outputs when building the dataset. Each golden record carries: + +```jsonl +{ + "id": "q001", + "input": "What were total sales by region last quarter?", + "expected_output": "...", + "curated_context": [ + "[File: databases/.../columns.md]\n# users table...", + "[SQL result]\nColumns: region, total\n| North | 1200000 |..." + ] +} +``` + +The eval test reads context directly from the file — no running backend needed. + +**Downside:** Context goes stale the moment the project folder, schema docs, or SQL results change. +If nao's prompt is updated and the agent now reads different files or generates different SQL, the +golden context no longer reflects what the model actually used. You end up judging faithfulness +against a context that wasn't actually provided to the LLM — which defeats the metric entirely. +Every dataset update becomes a manual two-step: run the agent, copy the tool outputs, paste them +into the JSONL. This does not scale past a handful of records. + +--- + +### Option B — Dynamic: capture context at eval runtime (chosen approach) + +At eval time, call a dedicated nao endpoint that runs the agent and returns **both** the final +answer and the exact tool results that produced it, in a single response. The eval harness uses +those tool results directly as `retrieval_context` for DeepEval. No context is stored in the +dataset. + +**Why this is correct:** + +- The context passed to DeepEval is always the context that was actually provided to the LLM for + *that specific run*. There is no staleness. +- If nao's prompt changes, the agent reads different files, or the database is updated, the eval + automatically reflects the new reality — it measures faithfulness of what nao actually said + against what nao actually saw. +- The golden dataset stays lean: only `input` and `expected_output`. No manual context curation. +- It also enables a future regression signal: if faithfulness drops after a prompt or tool change, + you know the model started making claims not grounded in its own context. + +The only requirement is a running nao backend, which is already needed to get `actual_output` +anyway. Option B adds zero extra infrastructure. + +**Maintenance burden:** `CONTEXT_TOOLS` in `apps/backend/src/routes/evals.ts` must be kept in sync +as new agent tools are added. When a new tool is introduced, the nao team needs to decide whether +its output is content-bearing (add it) or navigation-only (leave it out). This is a small but +ongoing maintenance cost that Option A does not have. + +--- + +## Architecture + +### What exists today + +`AgentManager` in `apps/backend/src/services/agent.ts` has two execution modes: + +- `stream()` — used by the main `/api/agent` route; returns a UI message stream to the browser +- `generate()` — non-streaming; returns `AgentRunResult` which already contains: + ```typescript + steps: ReadonlyArray<{ + toolCalls: ReadonlyArray<{ toolName: string; toolCallId: string; input: unknown }>; + toolResults: ReadonlyArray<{ toolCallId: string; output?: unknown }>; + }> + ``` + +`steps[*].toolResults` is the exact grounding context. It is already collected internally — it just +has no HTTP endpoint that exposes it. + +--- + +## What Changes in the Backend (minimal) + +Add one new Fastify route: `POST /api/evals/chat`. + +It mirrors `/api/agent` but: + +1. Calls `testAgentService.runTest()` (same service used by `nao test`) — non-streaming, returns + the full `AgentRunResult` +2. Returns JSON `{ text, model_id, tool_results }` instead of a UI message stream +3. `tool_results` = `TestAgentService.extractToolCalls(result)` filtered to content-bearing tools: + `execute_sql`, `read`, `grep` + — `list` and `search` are intentionally excluded: both return only file metadata + (path/dir/size), not file content. They are navigation tools — the agent uses them to decide + what to read next, not to compose its answer. Including them dilutes Contextual Relevancy + with path listings that have no semantic relationship to the question. +4. Accepts an optional `model: { provider, modelId }` body field to override the project default +5. Protected by the same `authMiddleware` as every other route — no additional gate needed + +No changes to existing routes or agent internals. `generate()` is already exercised by automations. + +**Response shape:** +```json +{ + "text": "Total sales by region last quarter were: North $1.2M...", + "model_id": "claude-sonnet-4-6", + "tool_results": [ + { "toolName": "read", "output": { "content": "# sales table\n..." } }, + { "toolName": "execute_sql", "output": { "columns": ["region", "total"], "data": [...] } } + ] +} +``` + +`model_id` reflects whichever model ran — either the project default from Settings → Project → +Models, or the override passed in the request body. + +--- + +## Directory Layout + +Evals follow the same project-relative convention as `nao test`. The framework lives in the +`nao_core` package; the user's golden dataset lives in their project folder alongside other test +assets. + +``` +# nao_core package (framework — shipped with nao) +cli/nao_core/evals/ + conftest.py # ALL shared fixtures + test_evals_triad.py # all metrics in one file + +# User's project (data — owned by the team) +{project}/tests/evals/ + golden_dataset.jsonl # {id, input, expected_output} — no context field +``` + +This mirrors how `nao test` works: `nao_core` contains the runner logic; the user's `tests/` folder +contains the test cases. Running `nao evals` from the project directory discovers +`tests/evals/golden_dataset.jsonl` via `Path.cwd()`, exactly as `nao test` discovers +`tests/*.yml`. + +**One file, one test function, all metrics together.** DeepEval's `assert_test(test_case, +metrics=[m1, m2, ...])` runs all metrics concurrently on the same test case. This means the +`/api/evals/chat` API call happens once per record regardless of how many metrics are active. +Separate files per metric would repeat that call once per metric per record — 4× the cost once all +four metrics are in place. + +The file is named `test_evals_triad.py` after the established RAG triad (faithfulness, context +relevance, answer relevance), with completeness as a fourth extension. The context here is not +retrieved but captured from the agent's own tool outputs at runtime — the triad is re-purposed for +curated context rather than retrieval. + +Adding a metric later = one new fixture in `conftest.py` + one new entry in the `metrics=[]` list. +Nothing else changes. + +--- + +## Golden Dataset Schema + +```jsonl +{"id": "q001", "input": "What were total sales by region last quarter?", "expected_output": "..."} +{"id": "q002", "input": "Which customers churned in May?", "expected_output": "..."} +``` + +No `curated_context` field. Context is captured live. + +--- + +## Eval Harness Flow (per test case) + +``` +for each record in golden_dataset.jsonl: + 1. POST /api/evals/chat { input: record.input, model?: { provider, modelId } } + 2. → { text: , model_id: , tool_results: [...] } + 3. curated_context = [serialize(tr) for tr in tool_results] + 4. LLMTestCase( + input = record.input, + actual_output = response.text, + retrieval_context = curated_context, + expected_output = record.expected_output # used by completeness later + ) + 5. for each metric: metric.measure(test_case) # direct call — owns the instance, reads score/reason + record result in session_results + raise AssertionError if any metric failed +``` + +All three metrics run concurrently on the same test case. DeepEval reports each score and reason +independently — a failure in one does not suppress the others. + +**Tool result serialization** (pure Python in `conftest.py`, not backend logic): + +| Tool | Returns | Included | Reason | +|---|---|---|---| +| `execute_sql` | data rows | yes | primary grounding — the actual query results the agent used | +| `read` | full file content | yes | schema docs, column descriptions, semantic models | +| `grep` | matching lines + context | yes | actual file content snippets | +| `search` | path/dir/size | **no** | glob file finder — navigation only, no content | +| `list` | path/dir/size | **no** | directory listing — navigation only, no content | + +`search` and `list` are excluded for the same reason: both return only file metadata. They are +navigation tools the agent uses to decide what to read next — not to compose its answer. Including +them adds path listings with no semantic relationship to the question, diluting Contextual +Relevancy without adding grounding signal. + +--- + +## DeepEval Metric Configuration + +The judge model is read from the `/api/evals/chat` response `model_id` field — whatever ran the +agent. `conftest.py` instantiates the judge from that model ID: + +- **Claude models** (`model_id.startswith("claude")`) — wrapped in a custom `_ClaudeJudge` + subclass of `DeepEvalBaseLLM` using the `anthropic` SDK directly +- **OpenAI and other providers** — passed as a plain string; DeepEval handles them natively + +```python +FaithfulnessMetric(threshold=0.7, model=judge, include_reason=True) +ContextualRelevancyMetric(threshold=0.5, model=judge, include_reason=True) +AnswerRelevancyMetric(threshold=0.7, model=judge, include_reason=True) +``` + +**Contextual Relevancy threshold is intentionally lower than the other two metrics.** In a classic +RAG pipeline, retrieval is scoped tightly to the question — only the most relevant chunks are +fetched. A chat agent works differently: it reads schema docs, semantic model definitions, and +column descriptions to build a broad understanding of the data model before it can compose a +correct answer. Much of that context is genuinely necessary for the agent to reason well, but it is +not directly about the specific question. The result is that Contextual Relevancy is structurally +diluted compared to a purpose-built retrieval system — by design, not by failure. + +This means a score of 0.5–0.65 on Contextual Relevancy may reflect a well-functioning agent +rather than a broken retrieval step. The practical threshold to start with: + +```python +FaithfulnessMetric(threshold=0.7, ...) # grounding: strict +ContextualRelevancyMetric(threshold=0.5, ...) # broad context: looser +AnswerRelevancyMetric(threshold=0.7, ...) # answer quality: strict +``` + +Calibrate from observed baselines as the dataset grows, not from RAG benchmarks. + +Using the same model as judge introduces self-serving bias — a model tends to score its own outputs +higher. To use an independent judge, pass `-m` with a different model than the one configured in +project settings — the agent runs with that model, and the judge uses the same model from the +response `model_id`. No code changes needed. + +--- + +## `conftest.py` Fixtures + +- **`nao_client`** — `httpx.Client` pointed at `http://localhost:5005` (or `NAO_EVAL_URL` env + var). Session-scoped. Cookie set from `NAO_AUTH_COOKIE` env var (injected by `nao evals` runner). +- **`pytest_generate_tests`** — parametrizes over records in `golden_dataset.jsonl`. Dataset path + comes from `NAO_EVALS_DIR` env var (set by runner to `{cwd}/tests/evals`). +- **`eval_response`** — calls `POST /api/evals/chat` for each record. Passes `model` body field if + `NAO_EVAL_MODEL` is set. +- **`judge_model`** — builds a `_ClaudeJudge` instance for Claude models; returns plain string for + OpenAI (DeepEval native). +- **`llm_test_case`** — serializes tool results into `retrieval_context`, returns `LLMTestCase`. +- **`faithfulness_metric`**, **`context_relevance_metric`**, **`answer_relevance_metric`** — metric + instances using the judge model. +- **`session_results`** — session-scoped list collecting `{id, input, actual_output, passed, metrics}` + for each record. Populated in `test_evals_triad.py` after each `metric.measure()` call. +- **`pytest_sessionfinish`** — writes `session_results` plus a `summary: { total, passed, failed }` + to `NAO_EVALS_OUTPUT_FILE` (set by the runner to `{project}/tests/outputs/evals_{timestamp}.json`) + at the end of the session. +- **`completeness_metric`** — add later; `GEval` with custom rubric (see below). + +--- + +## Running the Evals + +```bash +# Start nao backend normally +npm run dev -w @nao/backend + +# Run all evals (from your project directory) +nao evals + +# Specify the agent model (mirrors nao test -m) +nao evals -m anthropic:claude-sonnet-4-6 + +# Verbose output +nao evals -v + +# Run a single record by ID +nao evals -s q001 + +# Explicit credentials +nao evals -u user@example.com --password secret +``` + +Install eval deps once with `uv sync --extra evals` from `cli/`. The `evals` extra adds only +`deepeval` — provider SDKs come from the user's existing nao installation. + +Add `make evals` to the root `Makefile` for consistency with `make lint`. + +Results are written to `{project}/tests/outputs/evals_{timestamp}.json`: + +```json +{ + "timestamp": "2026-06-25T21:35:40.653031", + "results": [ + { + "id": "q001", + "input": "How many ports are currently decommissioned?", + "actual_output": "There are currently 4 decommissioned ports (i.e., ports with a non-null decommissioned_ts).", + "passed": true, + "metrics": [ + { "name": "Faithfulness", "score": 1.0, "threshold": 0.7, "passed": true, "reason": "The score is 1.00 because the actual output is perfectly faithful to the retrieval context with no contradictions found!" }, + { "name": "Contextual Relevancy", "score": 1.0, "threshold": 0.5, "passed": true, "reason": "The score is 1.00 because the retrieval context is perfectly relevant, directly addressing the question with a specific SQL query counting decommissioned ports and providing the exact answer: 'There are 4 decommissioned ports.'" }, + { "name": "Answer Relevancy", "score": 1.0, "threshold": 0.7, "passed": true, "reason": "The score is 1.00 because the response directly and completely addresses the question. Great job staying on topic!" } + ] + }, + { + "id": "q002", + "input": "What is the overall uptime percentage of my EV charging network for the full history?", + "actual_output": "Overall Uptime: 99.71% across Sep 15, 2025 – Jun 26, 2026. All 4,986 downtime minutes were concentrated in October 2025. Every other month recorded 100% uptime.", + "passed": true, + "metrics": [ + { "name": "Faithfulness", "score": 0.8889, "threshold": 0.7, "passed": true, "reason": "The score is 0.89 because the actual output incorrectly states total commissioned minutes as 1,722,240 when the retrieval context specifies 1,716,480." }, + { "name": "Contextual Relevancy", "score": 0.507, "threshold": 0.5, "passed": true, "reason": "The score is 0.51 because while a significant portion of the retrieval context is irrelevant (empty SQL results, database metadata, schema config), the context contains the direct answer '99.71% uptime' and supporting breakdowns. The large volume of irrelevant context dilutes the score." }, + { "name": "Answer Relevancy", "score": 0.8182, "threshold": 0.7, "passed": true, "reason": "The score is 0.82 because the output addresses overall uptime but includes unnecessary detail about which chargers caused the downtime events — not directly relevant to the question asked." } + ] + } + ], + "summary": { + "total": 2, + "passed": 2, + "failed": 0 + } +} +``` + +Both records pass. q001 is a clean 1.0 across all metrics — the agent found the exact SQL and answer. +q002 passes but reveals two signals worth tracking: Faithfulness at 0.89 (the agent misreported +one number) and Contextual Relevancy just above threshold at 0.51 (broad schema context still +diluting the score, consistent with the structural dilution described above). + +--- + +## CI + +Do not block PRs on evals — they are slow (~5–10 s per LLM judge call) and cost money. + +--- + +## Implementation Sequence + +1. **Add `POST /api/evals/chat` Fastify route** — ~40 lines in `apps/backend/src/routes/evals.ts`; + calls existing `agent.generate()`, returns `{ text, model_id, tool_results }`, accepts optional + `model` override +2. **Register route** in `apps/backend/src/app.ts` under `/api/evals` +3. **Write `cli/nao_core/evals/conftest.py`** — fixtures, judge model factory, tool result + serializer, dataset loader via `NAO_EVALS_DIR` +4. **Write `cli/nao_core/evals/test_evals_triad.py`** — single parametrized test; calls + `metric.measure(test_case)` per metric directly (not `assert_test`) so scores and reasons are + readable on the same instances; appends to `session_results`; raises `AssertionError` on failure + so that `nao evals` exits non-zero and pytest reports `FAILED` when thresholds aren't met — + consistent with `nao test`, which calls `sys.exit(1)` on any test failure. The JSON output + is always written regardless of pass/fail. +5. **Write `cli/nao_core/commands/evals/runner.py`** — `nao evals` command mirroring `nao test`; + `-m`, `-u`, `--password`, `-v`, `-s` flags; authenticates via `get_auth_session`, sets env vars, + invokes pytest; after pytest exits, reads `summary` from the JSON output and calls `sys.exit(1)` + if `failed > 0` — same pattern as `nao test` +6. **Register command** in `cli/nao_core/commands/__init__.py` and `cli/nao_core/main.py` +7. **Add `evals` extra** to `cli/pyproject.toml` — `deepeval>=2.0` only +8. **Populate `{project}/tests/evals/golden_dataset.jsonl`** — Q&A pairs from golden dataset + +--- + +## Adding Completeness Later + +The full triad ships from day one. Completeness is the only metric deferred — it has no +first-class DeepEval class yet. When ready, adding it is two steps: + +1. Add a `completeness_metric` fixture in `conftest.py` using `GEval` with a rubric such as + "does the answer fully address all parts of the question, given the expected output?" +2. Add it to the `metrics=[]` list in `test_evals_triad.py` + +No new files, no API changes, no dataset changes — `expected_output` is already in every +`LLMTestCase` from day one. + +| Metric | DeepEval class | Ships | Fields used | +|---|---|---|---| +| Faithfulness | `FaithfulnessMetric` | Day 1 | `input`, `actual_output`, `retrieval_context` | +| Context relevance | `ContextualRelevancyMetric` | Day 1 | `input`, `retrieval_context` | +| Answer relevance | `AnswerRelevancyMetric` | Day 1 | `input`, `actual_output` | +| Completeness | `GEval` (custom rubric) | Later | `input`, `actual_output`, `expected_output` |