From b839e7f962d74addcfb98a66382c258b8f2162fc Mon Sep 17 00:00:00 2001
From: Anna Bekharskaia <anna.v.bekharskaya@gmail.com>
Date: Wed, 13 May 2026 19:32:55 +0300
Subject: [PATCH 01/38] docs(demo): baseline for Stage 2

---
 demo/plan_pr2.md | 167 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 demo/plan_pr2.md

diff --git a/demo/plan_pr2.md b/demo/plan_pr2.md
new file mode 100644
index 0000000..671acb1
--- /dev/null
+++ b/demo/plan_pr2.md
@@ -0,0 +1,167 @@
+# Stage 1 Plan (MVP Baseline)
+
+## Status — where we stand
+
+**Stage 1 merged in [`#86`](https://github.com/appspace/kwwhat/pull/86)** (2026-05-07). That PR is the **baseline we build from** for Stage 2+ on branch `chat-evals-stabilization`.
+
+**Delivered in #86**
+
+- **SQL eval tests** in `demo/chat-bi/tests/*.yml` — added cases alongside existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) (`decommissioned_ports_check`, `lately_snapshot`, `network_reliability_uptime`).
+- **Stage 1 analysis artifacts** — updated [`stage1_analysis.csv`](demo/chat-bi/tests/analysis/stage1_analysis.csv) and [`stage1_delta_summary.md`](demo/chat-bi/tests/analysis/stage1_delta_summary.md) for the expanded `nao test` run.
+- **Reproducible outputs** — Docker volume maps host `demo/chat-bi/tests/outputs/` ↔ container `/app/kwwhat/tests/outputs/` so `results_*.json` is reachable without `docker cp` (see [`demo/README.md`](demo/README.md)).
+- **Prompt wording** tightened on new tests so questions match what SQL assertions exercise.
+- **Local run noise** — ignore rules for generated artifacts where agreed (no accidental bulk commits of raw JSON).
+- **Approach locked for Stage 1** — `nao test` + SQL as **factual** reference; **manual** `semantic_label` / `failure_reason` in the mini analysis table for a lightweight semantic layer; **no** G-eval automation in that PR.
+
+**This document** — keeps the Stage 1 spec and merge alignment as **reference**, plus the **Stage 2 / Stage 3 placeholder** sections below. Edits here target the stabilization branch unless noted otherwise.
+
+---
+
+**Goal:** Test whether the curated project context still drives correct chat behavior — not the LLM itself, not the chat product, not RAG. Catch regressions when [`agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`RULES.md`](demo/chat-bi/RULES.md), [`nao_config.yaml`](demo/chat-bi/nao_config.yaml), [`semantic_models.yml`](models/semantic/semantic_models.yml), or [`marts.yml`](models/marts/marts.yml) change.
+
+## Scope
+
+- Add 2–3 SQL test cases in `demo/chat-bi/tests/*.yml`
+- Run expanded `nao test`
+- Save outputs and build mini analysis table (`demo/chat-bi/tests/analysis/stage1_analysis.csv`)
+- Ensure outputs are saved via Docker volume (no manual copy)
+- Perform baseline vs expanded comparison (baseline = existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) run)
+- Send short update with findings
+
+## Done Criteria
+
+- `>=3` test files total (including existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml))
+- 1 expanded run completed
+- `results_*.json` retained following nao's local JSON output pattern (reachable via Docker volume)
+- Docker volume confirmed working (outputs reachable on host without `docker cp`)
+- mini analysis table (`demo/chat-bi/tests/analysis/stage1_analysis.csv`) filled for all tests in run
+- mini analysis table includes `failure_reason` for failed cases
+- Stage 1 test case format agreed (baseline: existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) structure)
+- baseline metrics recorded (accuracy / cost / time)
+- delta metrics captured (baseline vs expanded)
+- short status note in the PR (or agreed team channel)
+- storage decision documented for this run
+
+## Mini Analysis Table (Stage 1)
+
+Columns: `question`, `actual`, `expected` (optional), `status` (nao: `pass`/`fail`), `semantic_label` (`correct`/`partial`/`incorrect`), `failure_reason` (required if fail), `cost`, `execution_time`.
+
+**Note:** `status` and `semantic_label` are independent layers — `status` is factual (nao harness), `semantic_label` is human review (rubric: see **Alignment and remaining checks (merge)**).
+
+## Methodology Note
+
+- **`nao test`** — current **runtime** and **factual** baseline on the demo stack (DuckDB + chat-bi).
+- **Factual layer** — `nao test` + SQL reference (data correctness).
+- **Semantic layer (Stage 1)** — lightweight manual label (`semantic_label`) and `failure_reason`; no automated judge in this PR.
+- **Future direction** — native rubric/context eval support in `nao test` via upstream nao issue/contribution; not a Stage 1 blocker.
+
+## Stage 1 Boundary and Next Stages
+
+### Out of scope for Stage 1
+
+- Full integration of the **deepeval** package into the demo runtime or CI (unless explicitly approved later).
+- Making Answer Relevancy (or any deepeval metric) a required automated merge gate in Stage 1.
+- Sidecar/post-processing G-eval automation scripts for Stage 1.
+- Any automation that belongs to Stage 2+.
+
+### Stage 2+ direction (placeholder)
+
+- **Stage 2 (example):**
+
+  - **Upstream nao direction** — native generic/context evals in `nao test` via upstream issue/contribution; this is a parallel track, not a Stage 1 blocker.
+  - **Golden dataset** — single-turn; 10–12 entries; `question_id`, `user_input`, `reference_answer`, `reference_contexts` (or `source_refs`), optional `human_explanation`.
+  - **`reference_contexts` meaning** — evidence/source pointers (e.g. `RULES.md`, `semantic_models.yml`, `marts.yml`) used for traceability and review; **not** extra context injected into model prompts.
+  - **G-Eval rubrics** — Terminology (no `session`; source: [`RULES.md`](demo/chat-bi/RULES.md)), Rate Format (%, pp), Metric Validity (only metrics defined in [`semantic_models.yml`](models/semantic/semantic_models.yml)), Completeness vs expected output — aim for at least one custom metric.
+  - **Automation path** — prefer native `nao test` support over temporary post-processing scripts; local Stage 2 prep can proceed while upstream direction is clarified.
+  - **`results_summary.py`** — lightweight script to generate `stage1_analysis.csv` and `stage1_delta_summary.md` from `results_*.json`; proposed for early Stage 2 as first automation.
+  - **Standardize fields** — `semantic_metric`, `semantic_score`, `semantic_threshold`, `semantic_pass`, `semantic_reason`, `judge_model`.
+  - **Optional CI.**
+  - **Plan doc** — `plan_pr2.md` on branch `chat-evals-stabilization` (after Stage 1 merge).
+
+- **Stage 3 (example):**
+
+  - **Streamlit viewer** — color-coded summary (questions × rubrics), drill-down per row (actual vs expected + reason per rubric), aggregate trends per rubric across runs.
+  - **Chat replays (optional source)** — use nao chat replay storage as a read-only source of candidate eval cases; sanitize/anonymize before dataset inclusion.
+  - **Long-term storage** of eval history.
+  - **Stricter org-wide gates.**
+  - **Plan doc** — `plan_pr3.md` on branch `chat-evals-analytics` (after Stage 2 merge).
+
+### Visualization notes (non-blocking)
+
+Visualization is available as a future presentation layer, but it is **not required for this PR**.
+
+- **Stage 1:** no separate dashboard deliverable; use run artifacts + mini analysis table (`demo/chat-bi/tests/analysis/stage1_analysis.csv`) in PR updates.
+- **Stage 2:** keep visualization lightweight if useful (tables/recaps) while the eval mechanics and result format stabilize.
+- **Stage 3:** primary visualization layer can be **Streamlit** for eval results (summary, drill-down, reasons, trends).
+- **Out of scope for now:** standalone BI dashboard track (Power BI/Metabase on DuckDB) to avoid duplicate effort.
+- **Optional later:** revisit Tableau Public only if there is clear value in extending existing WIP.
+
+## Test Design Constraints
+
+**Rules:** [`demo/chat-bi/RULES.md`](demo/chat-bi/RULES.md), [`demo/chat-bi/agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`demo/chat-bi/nao_config.yaml`](demo/chat-bi/nao_config.yaml) — DuckDB only, no schema introspection, only `fact_*`/`dim_*` tables (`analytics.ANALYTICS.<table>`), default time window last 7 days.
+
+**SQL:** reference `sql:` must use real columns from [`models/marts/marts.yml`](models/marts/marts.yml) and only metrics defined in [`models/semantic/semantic_models.yml`](models/semantic/semantic_models.yml).
+
+**Narrative:** use `charge attempt`, `transaction`, `visit` — never `session`. Presentation rules (metrics at a glance, % format) apply to answer quality evaluation, not to the `sql:` block.
+
+**No extra context in tests:** do not inject additional context chunks into test prompts — tests rely only on the system context already configured in nao (`agent_instructions.md`, `RULES.md`, etc.). This is intentional: we test the curated context, not a retrieval layer.
+
+## PR (Stage 1 only)
+
+**PR title:**
+
+`Stage 1 MVP: initial eval set + expanded baseline run`
+
+**PR should include:**
+
+- new SQL tests in `demo/chat-bi/tests/*.yml`
+- expanded run outputs/artifacts (`results_*.json` from local nao output flow)
+- mini analysis table covering all tests from the expanded run
+- evaluation approach:
+  - SQL as factual reference layer
+  - LLM answer as semantic quality layer
+- link/reference to baseline run
+- delta summary: baseline vs expanded
+- short summary:
+  - pass/fail
+  - cost/time
+  - top 1–2 failure patterns
+  - next actions
+- storage outcome documented; artifact paths: container `/app/kwwhat/tests/outputs/`, repo/Docker volume `demo/chat-bi/tests/outputs/`
+- review note: `failure_reason` is filled for failed cases (`status=fail`)
+
+## Execution Order
+
+1. Add tests (`demo/chat-bi/tests/*.yml`)
+2. Run expanded `nao test`
+3. Save/retain `results_*.json` following nao's local JSON output pattern
+4. Fill mini analysis table (include `question`, `actual`, optional `expected`, `status`, `semantic_label`, `failure_reason`, `cost`, `execution_time`)
+5. Capture baseline vs expanded delta (accuracy / cost / time) and include a short summary of top 1–2 failure patterns
+6. Commit implementation work using one PR/branch approach consistently for this repo
+7. Post a short PR update and close items in **Alignment and remaining checks (merge)**.
+
+## Alignment and remaining checks (merge)
+
+Document the outcomes below in the PR (description or comments) before merge. They are **merge checks**, not a prerequisite for local work: tests, `nao test` runs, and analysis drafts can proceed in parallel.
+
+**Aligned decisions**
+
+- **Stage 1 scope** — proceed with SQL tests + manual `semantic_label`; no G-eval automation in this PR.
+- **Storage** — follow nao's existing local JSON output pattern (`results_*.json` in Docker volume / `tests/outputs` flow).
+- **Test case naming** — call Stage 1 YAML cases “SQL tests” for clarity.
+- **Test case format** — use existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) structure as Stage 1 baseline.
+- **Upstream nao issue** — native context/rubric eval support is a parallel Stage 2 input, not a Stage 1 blocker.
+
+**Remaining merge checks**
+
+- **Semantic rubric (Stage 1)** — confirm `semantic_label` values and concise rubric.
+  - **Proposed default:** `correct` = answer is accurate and complete; `partial` = correct conclusion but imprecise numbers or incomplete; `incorrect` = factual error or hallucination.
+- **Outputs/ignore rules** — ensure generated local outputs are not accidentally committed unless explicitly selected as a small analysis artifact.
+
+## Post-Stage 1 Calibration
+
+- acceptable cost per run
+- acceptable `execution_time`
+- numeric/order tolerance thresholds
+- priority among error categories (`hallucination` vs `wrong_filter` vs `sql_logic`)
+- top-priority eval scenarios for next iteration

From d1739a1e48a50e3dd9c62f9e0df4c14f403f3389 Mon Sep 17 00:00:00 2001
From: Anna Bekharskaia <anna.v.bekharskaya@gmail.com>
Date: Wed, 13 May 2026 19:49:41 +0300
Subject: [PATCH 02/38] docs(demo): updates for Stage 2

---
 demo/plan_pr2.md | 195 +++++++++++++++++------------------------------
 1 file changed, 69 insertions(+), 126 deletions(-)

diff --git a/demo/plan_pr2.md b/demo/plan_pr2.md
index 671acb1..04254d3 100644
--- a/demo/plan_pr2.md
+++ b/demo/plan_pr2.md
@@ -1,167 +1,110 @@
-# Stage 1 Plan (MVP Baseline)
+# Chat evals plan — Stage 2 (stabilization)
 
-## Status — where we stand
-
-**Stage 1 merged in [`#86`](https://github.com/appspace/kwwhat/pull/86)** (2026-05-07). That PR is the **baseline we build from** for Stage 2+ on branch `chat-evals-stabilization`.
-
-**Delivered in #86**
-
-- **SQL eval tests** in `demo/chat-bi/tests/*.yml` — added cases alongside existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) (`decommissioned_ports_check`, `lately_snapshot`, `network_reliability_uptime`).
-- **Stage 1 analysis artifacts** — updated [`stage1_analysis.csv`](demo/chat-bi/tests/analysis/stage1_analysis.csv) and [`stage1_delta_summary.md`](demo/chat-bi/tests/analysis/stage1_delta_summary.md) for the expanded `nao test` run.
-- **Reproducible outputs** — Docker volume maps host `demo/chat-bi/tests/outputs/` ↔ container `/app/kwwhat/tests/outputs/` so `results_*.json` is reachable without `docker cp` (see [`demo/README.md`](demo/README.md)).
-- **Prompt wording** tightened on new tests so questions match what SQL assertions exercise.
-- **Local run noise** — ignore rules for generated artifacts where agreed (no accidental bulk commits of raw JSON).
-- **Approach locked for Stage 1** — `nao test` + SQL as **factual** reference; **manual** `semantic_label` / `failure_reason` in the mini analysis table for a lightweight semantic layer; **no** G-eval automation in that PR.
-
-**This document** — keeps the Stage 1 spec and merge alignment as **reference**, plus the **Stage 2 / Stage 3 placeholder** sections below. Edits here target the stabilization branch unless noted otherwise.
+**Branch:** `chat-evals-stabilization`
+**Status:** Stage 1 merged in [**#86**](https://github.com/appspace/kwwhat/pull/86) (2026-05-07) — that is the baseline we build from.
 
 ---
 
-**Goal:** Test whether the curated project context still drives correct chat behavior — not the LLM itself, not the chat product, not RAG. Catch regressions when [`agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`RULES.md`](demo/chat-bi/RULES.md), [`nao_config.yaml`](demo/chat-bi/nao_config.yaml), [`semantic_models.yml`](models/semantic/semantic_models.yml), or [`marts.yml`](models/marts/marts.yml) change.
-
-## Scope
-
-- Add 2–3 SQL test cases in `demo/chat-bi/tests/*.yml`
-- Run expanded `nao test`
-- Save outputs and build mini analysis table (`demo/chat-bi/tests/analysis/stage1_analysis.csv`)
-- Ensure outputs are saved via Docker volume (no manual copy)
-- Perform baseline vs expanded comparison (baseline = existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) run)
-- Send short update with findings
+## Stage 2 — current focus
 
-## Done Criteria
+### Goal
 
-- `>=3` test files total (including existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml))
-- 1 expanded run completed
-- `results_*.json` retained following nao's local JSON output pattern (reachable via Docker volume)
-- Docker volume confirmed working (outputs reachable on host without `docker cp`)
-- mini analysis table (`demo/chat-bi/tests/analysis/stage1_analysis.csv`) filled for all tests in run
-- mini analysis table includes `failure_reason` for failed cases
-- Stage 1 test case format agreed (baseline: existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) structure)
-- baseline metrics recorded (accuracy / cost / time)
-- delta metrics captured (baseline vs expanded)
-- short status note in the PR (or agreed team channel)
-- storage decision documented for this run
+Strengthen repeatable evals: golden dataset, rubrics, light automation, stable result format. No blocking on upstream nao.
 
-## Mini Analysis Table (Stage 1)
+### Scope
 
-Columns: `question`, `actual`, `expected` (optional), `status` (nao: `pass`/`fail`), `semantic_label` (`correct`/`partial`/`incorrect`), `failure_reason` (required if fail), `cost`, `execution_time`.
+- **Golden dataset** — single-turn; ~10–12 entries; fields: `question_id`, `user_input`, `reference_answer`, `reference_contexts` (or `source_refs`), optional `human_explanation`.
+  - `reference_contexts` = evidence pointers (`RULES.md`, `semantic_models.yml`, `marts.yml`) for traceability and review — **not** extra chunks injected into model prompts.
+- **G-Eval rubrics** — e.g. Terminology (no `session`; source: [`RULES.md`](demo/chat-bi/RULES.md)), Rate Format (%, pp), Metric Validity (only metrics defined in [`semantic_models.yml`](models/semantic/semantic_models.yml)), Completeness vs expected output — aim for ≥1 custom rubric.
+- **`results_summary.py`** — lightweight script to generate `stage1_analysis.csv` and `stage1_delta_summary.md` from `results_*.json`; first automation step in Stage 2.
+- **Standardize result fields** — `semantic_metric`, `semantic_score`, `semantic_threshold`, `semantic_pass`, `semantic_reason`, `judge_model`.
+- **Upstream nao direction** — native generic/context evals in `nao test` via upstream issue/contribution; **parallel track**, not a Stage 2 blocker; local prep proceeds independently.
+- **Optional CI** — add gate once rubrics and flake are under control.
+- **Plan doc** — this file (`plan_pr2.md`) on branch `chat-evals-stabilization`.
 
-**Note:** `status` and `semantic_label` are independent layers — `status` is factual (nao harness), `semantic_label` is human review (rubric: see **Alignment and remaining checks (merge)**).
+### Visualization in Stage 2
 
-## Methodology Note
+Keep lightweight: tables and short markdown recaps in PRs. No standalone dashboard deliverable yet.
 
-- **`nao test`** — current **runtime** and **factual** baseline on the demo stack (DuckDB + chat-bi).
-- **Factual layer** — `nao test` + SQL reference (data correctness).
-- **Semantic layer (Stage 1)** — lightweight manual label (`semantic_label`) and `failure_reason`; no automated judge in this PR.
-- **Future direction** — native rubric/context eval support in `nao test` via upstream nao issue/contribution; not a Stage 1 blocker.
+### Post–Stage 1 calibration (carry into Stage 2 runs)
 
-## Stage 1 Boundary and Next Stages
+Tune with real runs and document outcomes in Stage 2 PRs:
 
-### Out of scope for Stage 1
+- Acceptable **cost** and **execution_time** per run.
+- Numeric / ordering **tolerance** thresholds.
+- **Priority** among error categories (`hallucination` vs `wrong_filter` vs `sql_logic`).
+- Top **eval scenarios** to grow first (e.g. partner support flows when ready).
 
-- Full integration of the **deepeval** package into the demo runtime or CI (unless explicitly approved later).
-- Making Answer Relevancy (or any deepeval metric) a required automated merge gate in Stage 1.
-- Sidecar/post-processing G-eval automation scripts for Stage 1.
-- Any automation that belongs to Stage 2+.
-
-### Stage 2+ direction (placeholder)
+---
 
-- **Stage 2 (example):**
+## Stage 3 — direction (branch `chat-evals-analytics`, after Stage 2 merge)
 
-  - **Upstream nao direction** — native generic/context evals in `nao test` via upstream issue/contribution; this is a parallel track, not a Stage 1 blocker.
-  - **Golden dataset** — single-turn; 10–12 entries; `question_id`, `user_input`, `reference_answer`, `reference_contexts` (or `source_refs`), optional `human_explanation`.
-  - **`reference_contexts` meaning** — evidence/source pointers (e.g. `RULES.md`, `semantic_models.yml`, `marts.yml`) used for traceability and review; **not** extra context injected into model prompts.
-  - **G-Eval rubrics** — Terminology (no `session`; source: [`RULES.md`](demo/chat-bi/RULES.md)), Rate Format (%, pp), Metric Validity (only metrics defined in [`semantic_models.yml`](models/semantic/semantic_models.yml)), Completeness vs expected output — aim for at least one custom metric.
-  - **Automation path** — prefer native `nao test` support over temporary post-processing scripts; local Stage 2 prep can proceed while upstream direction is clarified.
-  - **`results_summary.py`** — lightweight script to generate `stage1_analysis.csv` and `stage1_delta_summary.md` from `results_*.json`; proposed for early Stage 2 as first automation.
-  - **Standardize fields** — `semantic_metric`, `semantic_score`, `semantic_threshold`, `semantic_pass`, `semantic_reason`, `judge_model`.
-  - **Optional CI.**
-  - **Plan doc** — `plan_pr2.md` on branch `chat-evals-stabilization` (after Stage 1 merge).
+- **Streamlit viewer** — color-coded summary (questions × rubrics), drill-down per row (actual vs expected + reason per rubric), aggregate trends per rubric across runs.
+- **Chat replays (optional source)** — use nao chat replay storage as a read-only source of candidate eval cases; sanitize/anonymize before dataset inclusion.
+- **Long-term storage** of eval history.
+- **Stricter org-wide gates.**
+- **Plan doc** — `plan_pr3.md` on branch `chat-evals-analytics` (after Stage 2 merge).
 
-- **Stage 3 (example):**
+---
 
-  - **Streamlit viewer** — color-coded summary (questions × rubrics), drill-down per row (actual vs expected + reason per rubric), aggregate trends per rubric across runs.
-  - **Chat replays (optional source)** — use nao chat replay storage as a read-only source of candidate eval cases; sanitize/anonymize before dataset inclusion.
-  - **Long-term storage** of eval history.
-  - **Stricter org-wide gates.**
-  - **Plan doc** — `plan_pr3.md` on branch `chat-evals-analytics` (after Stage 2 merge).
+## Out of scope
 
-### Visualization notes (non-blocking)
+- **deepeval** full integration into demo runtime or CI as a **required** merge gate — unless explicitly approved.
+- **Sidecar/post-processing G-eval scripts** as a required Stage 2 gate before rubrics are stable.
+- **Standalone BI dashboard** (Power BI / Metabase on DuckDB) — duplicate effort; revisit only with clear value.
+- **Tableau Public** — optional later if portfolio case warrants it.
+- **MCP / external agent distribution** — separate parallel track; not part of this eval roadmap.
 
-Visualization is available as a future presentation layer, but it is **not required for this PR**.
+---
 
-- **Stage 1:** no separate dashboard deliverable; use run artifacts + mini analysis table (`demo/chat-bi/tests/analysis/stage1_analysis.csv`) in PR updates.
-- **Stage 2:** keep visualization lightweight if useful (tables/recaps) while the eval mechanics and result format stabilize.
-- **Stage 3:** primary visualization layer can be **Streamlit** for eval results (summary, drill-down, reasons, trends).
-- **Out of scope for now:** standalone BI dashboard track (Power BI/Metabase on DuckDB) to avoid duplicate effort.
-- **Optional later:** revisit Tableau Public only if there is clear value in extending existing WIP.
+## Stage 1 — completed reference
 
-## Test Design Constraints
+> Full execution checklists and PR alignment live in [**#86**](https://github.com/appspace/kwwhat/pull/86). This section is a compact record to orient Stage 2.
 
-**Rules:** [`demo/chat-bi/RULES.md`](demo/chat-bi/RULES.md), [`demo/chat-bi/agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`demo/chat-bi/nao_config.yaml`](demo/chat-bi/nao_config.yaml) — DuckDB only, no schema introspection, only `fact_*`/`dim_*` tables (`analytics.ANALYTICS.<table>`), default time window last 7 days.
+### Goal (achieved)
 
-**SQL:** reference `sql:` must use real columns from [`models/marts/marts.yml`](models/marts/marts.yml) and only metrics defined in [`models/semantic/semantic_models.yml`](models/semantic/semantic_models.yml).
+Test whether **curated project context** drives correct chat behavior — not the LLM product, not RAG — and catch regressions when [`agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`RULES.md`](demo/chat-bi/RULES.md), [`nao_config.yaml`](demo/chat-bi/nao_config.yaml), [`semantic_models.yml`](models/semantic/semantic_models.yml), or [`marts.yml`](models/marts/marts.yml) change.
 
-**Narrative:** use `charge attempt`, `transaction`, `visit` — never `session`. Presentation rules (metrics at a glance, % format) apply to answer quality evaluation, not to the `sql:` block.
+### Delivered in #86
 
-**No extra context in tests:** do not inject additional context chunks into test prompts — tests rely only on the system context already configured in nao (`agent_instructions.md`, `RULES.md`, etc.). This is intentional: we test the curated context, not a retrieval layer.
+- **SQL eval tests** in `demo/chat-bi/tests/*.yml` — `decommissioned_ports_check`, `lately_snapshot`, `network_reliability_uptime` added alongside existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml).
+- **Analysis artifacts** — [`stage1_analysis.csv`](demo/chat-bi/tests/analysis/stage1_analysis.csv) and [`stage1_delta_summary.md`](demo/chat-bi/tests/analysis/stage1_delta_summary.md) filled for the expanded `nao test` run (2 passed, 2 failed — both failures: SQL binder error `Catalog "RAW" does not exist`, tracked as context/namespace drift signal).
+- **Reproducible outputs** — Docker volume: host `demo/chat-bi/tests/outputs/` ↔ `/app/kwwhat/tests/outputs/`; `results_*.json` reachable without `docker cp` (see [`demo/README.md`](demo/README.md)).
+- **Prompt wording** tightened to match SQL assertions.
+- **Ignore rules** — local run noise excluded from commits.
 
-## PR (Stage 1 only)
+### Approach locked for Stage 1 (carries forward as baseline convention)
 
-**PR title:**
+- `nao test` + SQL = **factual** reference layer.
+- Manual `semantic_label` / `failure_reason` = lightweight semantic layer; no automated G-eval judge in Stage 1.
+- YAML cases called **"SQL tests"**; baseline shape = [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) structure.
+- Storage: nao JSON pattern, Docker volume path above.
+- **Upstream native evals** = parallel Stage 2 input, not a Stage 1 blocker.
 
-`Stage 1 MVP: initial eval set + expanded baseline run`
+### Decisions locked with #86
 
-**PR should include:**
+- `semantic_label` values: `correct` = accurate and complete; `partial` = correct conclusion but imprecise or incomplete; `incorrect` = factual error or hallucination.
+- `status` and `semantic_label` are **independent layers** — harness pass/fail vs human rubric.
+- Do not bulk-commit raw `results_*.json`; keep selected analysis artifacts only.
 
-- new SQL tests in `demo/chat-bi/tests/*.yml`
-- expanded run outputs/artifacts (`results_*.json` from local nao output flow)
-- mini analysis table covering all tests from the expanded run
-- evaluation approach:
-  - SQL as factual reference layer
-  - LLM answer as semantic quality layer
-- link/reference to baseline run
-- delta summary: baseline vs expanded
-- short summary:
-  - pass/fail
-  - cost/time
-  - top 1–2 failure patterns
-  - next actions
-- storage outcome documented; artifact paths: container `/app/kwwhat/tests/outputs/`, repo/Docker volume `demo/chat-bi/tests/outputs/`
-- review note: `failure_reason` is filled for failed cases (`status=fail`)
+### Mini analysis table columns (Stage 1; extend for Stage 2)
 
-## Execution Order
+`question`, `actual`, `expected` (optional), `status`, `semantic_label`, `failure_reason` (if fail), `cost`, `execution_time` (+ `run_id`, `model`, etc. already used in CSV).
 
-1. Add tests (`demo/chat-bi/tests/*.yml`)
-2. Run expanded `nao test`
-3. Save/retain `results_*.json` following nao's local JSON output pattern
-4. Fill mini analysis table (include `question`, `actual`, optional `expected`, `status`, `semantic_label`, `failure_reason`, `cost`, `execution_time`)
-5. Capture baseline vs expanded delta (accuracy / cost / time) and include a short summary of top 1–2 failure patterns
-6. Commit implementation work using one PR/branch approach consistently for this repo
-7. Post a short PR update and close items in **Alignment and remaining checks (merge)**.
+---
 
-## Alignment and remaining checks (merge)
+## Test design constraints (all stages)
 
-Document the outcomes below in the PR (description or comments) before merge. They are **merge checks**, not a prerequisite for local work: tests, `nao test` runs, and analysis drafts can proceed in parallel.
+**Rules:** [`RULES.md`](demo/chat-bi/RULES.md), [`agent_instructions.md`](demo/chat-bi/agent_instructions.md), [`nao_config.yaml`](demo/chat-bi/nao_config.yaml) — DuckDB only, no schema introspection, only `fact_*`/`dim_*` tables (`analytics.ANALYTICS.<table>`), default time window last 7 days.
 
-**Aligned decisions**
+**SQL:** `sql:` must use real columns from [`marts.yml`](models/marts/marts.yml) and only metrics defined in [`semantic_models.yml`](models/semantic/semantic_models.yml).
 
-- **Stage 1 scope** — proceed with SQL tests + manual `semantic_label`; no G-eval automation in this PR.
-- **Storage** — follow nao's existing local JSON output pattern (`results_*.json` in Docker volume / `tests/outputs` flow).
-- **Test case naming** — call Stage 1 YAML cases “SQL tests” for clarity.
-- **Test case format** — use existing [`total_ports.yml`](demo/chat-bi/tests/total_ports.yml) structure as Stage 1 baseline.
-- **Upstream nao issue** — native context/rubric eval support is a parallel Stage 2 input, not a Stage 1 blocker.
+**Narrative:** `charge attempt`, `transaction`, `visit` — never `session`. Presentation rules (%, pp) apply to **answer quality** evaluation, not to the `sql:` block.
 
-**Remaining merge checks**
+**No extra context in test prompts** — no injected chunks; tests exercise curated nao context only (`agent_instructions.md`, `RULES.md`, etc.).
 
-- **Semantic rubric (Stage 1)** — confirm `semantic_label` values and concise rubric.
-  - **Proposed default:** `correct` = answer is accurate and complete; `partial` = correct conclusion but imprecise numbers or incomplete; `incorrect` = factual error or hallucination.
-- **Outputs/ignore rules** — ensure generated local outputs are not accidentally committed unless explicitly selected as a small analysis artifact.
+---
 
-## Post-Stage 1 Calibration
+## Scratch / course notes (ephemeral)
 
-- acceptable cost per run
-- acceptable `execution_time`
-- numeric/order tolerance thresholds
-- priority among error categories (`hallucination` vs `wrong_filter` vs `sql_logic`)
-- top-priority eval scenarios for next iteration
+_Bullets from evals course / syncs. Trim or delete before merging to `main`._

From faf58dd1ed3a5545df0324f9423fa964fa8c9972 Mon Sep 17 00:00:00 2001
From: daria-sukhareva <daria@kwwhat.com>
Date: Thu, 14 May 2026 11:27:25 -0400
Subject: [PATCH 03/38] entry format and prompts

just thought nothing is finalized more like food for thoughts
---
 demo/thoughts.md | 106 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 demo/thoughts.md

diff --git a/demo/thoughts.md b/demo/thoughts.md
new file mode 100644
index 0000000..6643ad4
--- /dev/null
+++ b/demo/thoughts.md
@@ -0,0 +1,106 @@
+# Thoughts — golden dataset format
+
+## Proposed entry format
+
+```yaml
+- question_id: q009
+  category: metric_validity
+  eval_type: LLM-as-judge
+  user_input: "What is the charge point availability index?"
+  primary_context: models/semantic/semantic_models.yml
+  reference_answer: |
+    The metric "charge point availability index" is not defined in the semantic model.
+    The closest available metric is **uptime**. Would you like that instead?
+  reference_contexts:
+    - file: models/semantic/semantic_models.yml
+      hint: only metrics defined here are valid
+    - file: demo/chat-bi/RULES.md
+      hint: do not make up metrics
+  human_explanation: Model must decline and redirect — no hallucinated metrics.
+```
+
+## Field notes
+
+| Field | Purpose |
+|-------|---------|
+| `category` | groups entries by rubric (`metric_validity`, `terminology`, `rate_format`, `completeness`) |
+| `eval_type` | `LLM-as-judge` or `sql` — determines how the harness evaluates the answer |
+| `primary_context` | single file path for reviewer traceability — not injected into prompts |
+| `reference_contexts` | supporting evidence pointers for reviewer traceability; not injected into prompts |
+| `reference_answer` | what a correct answer looks like — format and intent, not exact match |
+| `human_explanation` | one line; captures the rubric intent for the judge prompt |
+
+## Open question
+
+Should `primary_context` point to the file that lets a reviewer **verify** the answer
+(e.g. `semantic_models.yml` to confirm the metric doesn't exist), or the file that
+states the **rule** being tested (e.g. `RULES.md` — "do not make up metrics")?
+
+---
+
+## Eval prompt
+
+Two parts: a **system prompt** (static, judge persona) and a **user prompt** (rendered per entry).
+
+### System prompt
+
+```
+You are an evaluator for a BI chat assistant that answers questions about EV charging networks.
+Your job is to score the assistant's response against the reference answer and criteria provided.
+Be strict and concise. Return only valid JSON.
+```
+
+### User prompt template
+
+```
+## User input
+{{ user_input }}
+
+## Reference answer
+{{ reference_answer }}
+
+## Evaluation criteria
+Category: {{ category }}
+{{ human_explanation }}
+
+## Actual response
+{{ actual_response }}
+
+**ANALYZE THE ACTUAL RESPONSE FOR THIS USER INPUT**
+
+Assess:
+
+1. **usefulness** ("high" / "medium" / "low"): How closely does the actual response align with the reference answer?
+   - "high": Matches the reference answer in intent, format, and key facts
+   - "medium": Partially aligned — correct intent but missing format or key details
+   - "low": Misaligned — wrong intent, wrong facts, or contradicts the reference answer
+
+2. **signal_pct** (0–100): What percentage of this actual response is RELEVANT to the reference answer?
+   - 80–100: Highly focused, almost all content maps to the reference answer
+   - 50–79: Majority relevant, some content not in the reference answer
+   - 20–49: Less than half maps to the reference answer
+   - 0–19: Almost no overlap with the reference answer
+
+3. **note**: Brief explanation (1 sentence)
+
+**BE STRICT:**
+- Do not give "high" if the response is missing key facts or format requirements from the reference answer
+- Do not round up signal_pct — penalize filler, hedging, or content not grounded in the reference answer
+- A polite but wrong answer is still "low"
+
+**OUTPUT FORMAT (JSON):**
+{
+  "question_id": "{{ question_id }}",
+  "category": "{{ category }}",
+  "summary": {
+    "usefulness": "high" | "medium" | "low",
+    "signal_pct": 0-100,
+    "note": "<one sentence>"
+  }
+}
+```
+
+### Notes
+
+- `primary_context` is a reviewer traceability pointer — not injected into the judge prompt.
+- The judge model goes into a `judge_model` field on the result record (not in the prompt itself).

From 49a95d97981e1d320ce20228ac1f3dcb0c2219a7 Mon Sep 17 00:00:00 2001
From: daria-sukhareva <daria@kwwhat.com>
Date: Fri, 15 May 2026 11:52:35 -0400
Subject: [PATCH 04/38] Update thoughts.md

---
 demo/thoughts.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 6643ad4..7c4f694 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -30,6 +30,16 @@
 | `reference_answer` | what a correct answer looks like — format and intent, not exact match |
 | `human_explanation` | one line; captures the rubric intent for the judge prompt |
 
+## Categories to consider
+
+| Category | Question it answers |
+|----------|-------------------|
+| `metric_validity` | Did the model avoid inventing metrics not defined in the context? |
+| `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? |
+| `answer_relevance` | Does the answer address the user input? |
+| `terminology` | Did the model use correct vocabulary ("charge attempt" / "visit", never "session")? |
+| `completeness` | Does the response cover everything the user input asked for? |
+
 ## Open question
 
 Should `primary_context` point to the file that lets a reviewer **verify** the answer

From 294887c04012ed6f645a98b9d17d5beb5ab6d8b7 Mon Sep 17 00:00:00 2001
From: daria-sukhareva <daria@kwwhat.com>
Date: Fri, 15 May 2026 11:58:46 -0400
Subject: [PATCH 05/38] criteria

criteria
---
 demo/thoughts.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 7c4f694..22a6d7e 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -32,8 +32,8 @@
 
 ## Categories to consider
 
-| Category | Question it answers |
-|----------|-------------------|
+| Category | Criteria |
+|----------|----------|
 | `metric_validity` | Did the model avoid inventing metrics not defined in the context? |
 | `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? |
 | `answer_relevance` | Does the answer address the user input? |
@@ -71,6 +71,7 @@ Be strict and concise. Return only valid JSON.
 
 ## Evaluation criteria
 Category: {{ category }}
+Criteria: {{ criteria }}
 {{ human_explanation }}
 
 ## Actual response

From d3cdb85b312b0760bb73e3e8125f1c140b31440d Mon Sep 17 00:00:00 2001
From: daria-sukhareva <daria@kwwhat.com>
Date: Fri, 15 May 2026 12:16:01 -0400
Subject: [PATCH 06/38] why it matters

---
 demo/thoughts.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 22a6d7e..be20d96 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -32,13 +32,13 @@
 
 ## Categories to consider
 
-| Category | Criteria |
-|----------|----------|
-| `metric_validity` | Did the model avoid inventing metrics not defined in the context? |
-| `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? |
-| `answer_relevance` | Does the answer address the user input? |
-| `terminology` | Did the model use correct vocabulary ("charge attempt" / "visit", never "session")? |
-| `completeness` | Does the response cover everything the user input asked for? |
+| Category | Criteria | Why it matters |
+|----------|----------|----------------|
+| `metric_validity` | Did the model avoid inventing metrics not defined in the context? | prevents made-up KPIs from reaching dashboards and decisions |
+| `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? | prevents hallucinations, ensures traceable answers, builds user trust |
+| `answer_relevance` | Does the answer address the user input? | ensures the response is useful, not just grounded in context |
+| `terminology` | Did the model use correct vocabulary ("charge attempt" / "visit", never "session")? | keeps domain language consistent across the product and reports |
+| `completeness` | Does the response cover everything the user input asked for? | avoids partial answers that require follow-up or cause misinterpretation |
 
 ## Open question
 

From 944fcf58bd504d5e1e726424e11973590ebcb763 Mon Sep 17 00:00:00 2001
From: daria-sukhareva <daria@kwwhat.com>
Date: Fri, 15 May 2026 12:26:38 -0400
Subject: [PATCH 07/38] where sql might fit in

---
 demo/thoughts.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index be20d96..98493c7 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -32,13 +32,14 @@
 
 ## Categories to consider
 
-| Category | Criteria | Why it matters |
-|----------|----------|----------------|
-| `metric_validity` | Did the model avoid inventing metrics not defined in the context? | prevents made-up KPIs from reaching dashboards and decisions |
-| `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? | prevents hallucinations, ensures traceable answers, builds user trust |
-| `answer_relevance` | Does the answer address the user input? | ensures the response is useful, not just grounded in context |
-| `terminology` | Did the model use correct vocabulary ("charge attempt" / "visit", never "session")? | keeps domain language consistent across the product and reports |
-| `completeness` | Does the response cover everything the user input asked for? | avoids partial answers that require follow-up or cause misinterpretation |
+| Category | Criteria | Why it matters | Approach |
+|----------|----------|----------------|----------|
+| `sql_test` | Exact match against SQL assertion | catches regressions in factual outputs immediately | rule |
+| `metric_validity` | Did the model avoid inventing metrics not defined in the context? | prevents made-up KPIs from reaching dashboards and decisions | LLM-as-judge |
+| `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? | prevents hallucinations, ensures traceable answers, builds user trust | LLM-as-judge |
+| `answer_relevance` | Does the answer address the user input? | ensures the response is useful, not just grounded in context | LLM-as-judge |
+| `terminology` | Did the model use correct vocabulary ("charge attempt" / "visit", never "session")? | keeps domain language consistent across the product and reports | LLM-as-judge |
+| `completeness` | Does the response cover everything the user input asked for? | avoids partial answers that require follow-up or cause misinterpretation | LLM-as-judge |
 
 ## Open question
 

From 97f78c2dac3094ad08aca9adb831b6a664ef38ec Mon Sep 17 00:00:00 2001
From: daria-sukhareva <daria@kwwhat.com>
Date: Fri, 22 May 2026 10:29:59 -0400
Subject: [PATCH 08/38] completeness wording

---
 demo/thoughts.md | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 98493c7..96c1f73 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -39,7 +39,7 @@
 | `faithfulness` | Does the answer stick to facts — no hallucination or unsupported claims? | prevents hallucinations, ensures traceable answers, builds user trust | LLM-as-judge |
 | `answer_relevance` | Does the answer address the user input? | ensures the response is useful, not just grounded in context | LLM-as-judge |
 | `terminology` | Did the model use correct vocabulary ("charge attempt" / "visit", never "session")? | keeps domain language consistent across the product and reports | LLM-as-judge |
-| `completeness` | Does the response cover everything the user input asked for? | avoids partial answers that require follow-up or cause misinterpretation | LLM-as-judge |
+| `completeness` | Does the response cover everything the user input asked for? Is it complete? Is it at the right level of detail? | avoids partial answers that require follow-up or cause misinterpretation | LLM-as-judge |
 
 ## Open question
 
@@ -116,3 +116,13 @@ Assess:
 
 - `primary_context` is a reviewer traceability pointer — not injected into the judge prompt.
 - The judge model goes into a `judge_model` field on the result record (not in the prompt itself).
+
+## Execution strategy
+
+**Prompt chaining** — break eval into stages where each stage feeds the next:
+
+1. **Retrieve** — fetch the actual response for a `question_id` from the chat log
+2. **Judge** — render the user prompt template and call the judge model; get back the JSON score
+3. **Aggregate** — collect scores across entries, compute pass rates per category
+
+Each stage is a discrete LLM call (or SQL step). The output of one becomes the input of the next. This keeps each prompt focused and makes failures easy to isolate.

From 03b0d7c4d0530d787fa3908d74a0df8dbabaac33 Mon Sep 17 00:00:00 2001
From: daria-sukhareva <daria@kwwhat.com>
Date: Fri, 22 May 2026 10:36:02 -0400
Subject: [PATCH 09/38] tradeoffs to consider

---
 demo/thoughts.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 96c1f73..a9ff44d 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -95,6 +95,9 @@ Assess:
 
 3. **note**: Brief explanation (1 sentence)
 
+**KEY TRADEOFFS TO CONSIDER:**
+- A short reply can have HIGHER signal than a long one — brevity is not a flaw if the answer is complete
+
 **BE STRICT:**
 - Do not give "high" if the response is missing key facts or format requirements from the reference answer
 - Do not round up signal_pct — penalize filler, hedging, or content not grounded in the reference answer

From 131609abc863eadb2142b515d2f26f8619dc7793 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 12 Jun 2026 11:43:30 -0400
Subject: [PATCH 10/38] Update thoughts.md

more thoughts
---
 demo/thoughts.md | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index a9ff44d..b8f2697 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -1,5 +1,33 @@
 # Thoughts — golden dataset format
 
+## What we're building
+
+We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this project. Here is what each term means:
+
+**LLM-as-a-judge** — instead of checking outputs with deterministic rules or exact string matches, a second LLM (the "judge") reads the assistant's response and scores it against expected answer. This handles the inherent non-determinism of Chat BI.
+
+**Single-turn** — each eval entry is one self-contained exchange: one user input, one assistant response. There is no conversation history or multi-step context to manage. The assistant is evaluated on what it says in a single reply, making scoring deterministic and reproducible.
+
+**Reference-based** — every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in something concrete rather than asking it to reason from the rubric alone.
+
+**Local-first** — the eval harness runs entirely on the developer's machine, with no external eval platform or cloud service required. The golden dataset, judge prompts, scores, and results all live in this repository. This keeps the feedback loop fast, keeps data private, and means the eval is as easy to run as any other dbt command.
+
+**Same model as judge** — we plan to use the same model that powers Chat BI as the judge. This simplifies setup: no second API key, no model version management, no configuration drift. The trade-off is that models tend to score their own outputs more favorably — a known bias in self-evaluation. We accept this for now in exchange for simplicity, and can swap in a separate judge model later if scores prove unreliable.
+
+**Eval framework (deepeval or similar)** — rather than hand-rolling judge prompts, we intend to use an established eval framework. This means the noa labs team does not need to become prompt engineering experts just to run evaluations — the framework owns the judge prompt design and scoring logic. It also provides a foundation that can be extended to multi-turn evals later without rebuilding from scratch.
+
+Stackable metrics to begin with:
+
+| Metric | What it checks |
+|--------|---------------|
+| Answer relevance | Does the response address what the user actually asked? |
+| Faithfulness | Is every claim grounded in the retrieved context — no hallucinations? |
+| Metric validity | Did the model avoid inventing metrics not defined in the semantic model? |
+| Terminology | Did the model use the correct domain vocabulary? |
+| Completeness | Did the response cover the full scope of the question at the right level of detail? |
+
+---
+
 ## Proposed entry format
 
 ```yaml

From 2438320cc7a28c007e5f2bd31a781d36487b2421 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 12 Jun 2026 11:52:45 -0400
Subject: [PATCH 11/38] Update thoughts.md

goal
---
 demo/thoughts.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index b8f2697..35b562c 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -2,6 +2,8 @@
 
 ## What we're building
 
+**Goal** — implement an evals framework that quantifies the impact of context changes on the nao Chat BI tool. We are not testing the LLM or general chat performance. We are testing one specific thing: did a change to the context — RULES.md, semantic model definitions, or similar input files — make the assistant's answers better or worse? The eval score is a signal for context quality, not model quality.
+
 We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this project. Here is what each term means:
 
 **LLM-as-a-judge** — instead of checking outputs with deterministic rules or exact string matches, a second LLM (the "judge") reads the assistant's response and scores it against expected answer. This handles the inherent non-determinism of Chat BI.

From 8d12a27fc12e13acf04c9958ba3b71acdabe1fdf Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 12 Jun 2026 12:00:54 -0400
Subject: [PATCH 12/38] Update thoughts.md

wording
---
 demo/thoughts.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 35b562c..8ae6dfd 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -8,9 +8,9 @@ We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this pro
 
 **LLM-as-a-judge** — instead of checking outputs with deterministic rules or exact string matches, a second LLM (the "judge") reads the assistant's response and scores it against expected answer. This handles the inherent non-determinism of Chat BI.
 
-**Single-turn** — each eval entry is one self-contained exchange: one user input, one assistant response. There is no conversation history or multi-step context to manage. The assistant is evaluated on what it says in a single reply, making scoring deterministic and reproducible.
+**Single-turn** — each eval entry is a single, atomic unit of interaction with the LLM app: one user input, one assistant response, no conversation history. The assistant is evaluated on what it says in that one reply, in isolation.
 
-**Reference-based** — every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in something concrete rather than asking it to reason from the rubric alone.
+**Reference-based** — every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in whether the agent arrived at the expected answer rather than asking it to reason from the rubric alone.
 
 **Local-first** — the eval harness runs entirely on the developer's machine, with no external eval platform or cloud service required. The golden dataset, judge prompts, scores, and results all live in this repository. This keeps the feedback loop fast, keeps data private, and means the eval is as easy to run as any other dbt command.
 

From eaab63165fd0583ec1fb3ef349a58065aeb210d9 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 12 Jun 2026 12:08:37 -0400
Subject: [PATCH 13/38] Update thoughts.md

compare eval framework options
---
 demo/thoughts.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 8ae6dfd..3c5692a 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -30,6 +30,20 @@ Stackable metrics to begin with:
 
 ---
 
+## Framework options
+
+| | DeepEval | Latitude | In-house |
+|--|----------|----------|----------|
+| **Approach** | Pre-built generic metric library | Evals derived from production failures and expert judgment | Hand-rolled judge prompts |
+| **Best for** | Pre-production unit testing — useful when there is no production traffic yet | Post-production — requires real failure data to work from | Full control, no dependencies |
+| **Main value** | No prompt engineering required; plug in metrics and run | Evals grounded in actual user failures, not generic rubrics | Fully tailored to the use case |
+| **Main risk** | Score quality depends on golden dataset quality — the framework is only as good as the reference answers | Not useful pre-production; concedes this itself | Requires prompt engineering expertise; hard to maintain |
+| **Multi-turn path** | Built in | Built in | Rebuild from scratch |
+
+**Why this matters for nao:** we are targeting reference-based evals and assume end users arrive with a golden dataset. This changes the comparison significantly. The "generic metrics lie" concern — the main argument against DeepEval — is largely neutralised when every entry has a `reference_answer`: the judge is not scoring in the abstract, it is comparing against a concrete expected output. Latitude's value proposition (evals grounded in real production failures) does not apply here; a golden dataset replaces the need for production traffic. In-house also becomes more viable since comparing against a reference answer is a simpler judge prompt than scoring on abstract rubrics — but DeepEval still wins on setup cost and the multi-turn path for the noa labs team.
+
+---
+
 ## Proposed entry format
 
 ```yaml

From 9cf513954c7fbad62f1c0a5eb97bd710962ab0a0 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 12 Jun 2026 12:25:54 -0400
Subject: [PATCH 14/38] Update thoughts.md

end-to-end
---
 demo/thoughts.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 3c5692a..0053c29 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -12,6 +12,8 @@ We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this pro
 
 **Reference-based** — every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in whether the agent arrived at the expected answer rather than asking it to reason from the rubric alone.
 
+**End-to-end** — we evaluate the observable input and output of the Chat BI system and treat it as a black box. We do not instrument internal steps — no retrieval spans, no tool call traces, no sub-agent scoring. We care about the result the user sees, not the path the system took to produce it. This is the right fit for context-change evals: if the answer improved, the context change worked, regardless of what happened inside.
+
 **Local-first** — the eval harness runs entirely on the developer's machine, with no external eval platform or cloud service required. The golden dataset, judge prompts, scores, and results all live in this repository. This keeps the feedback loop fast, keeps data private, and means the eval is as easy to run as any other dbt command.
 
 **Same model as judge** — we plan to use the same model that powers Chat BI as the judge. This simplifies setup: no second API key, no model version management, no configuration drift. The trade-off is that models tend to score their own outputs more favorably — a known bias in self-evaluation. We accept this for now in exchange for simplicity, and can swap in a separate judge model later if scores prove unreliable.

From 56bfc1b238c2770390973ed9b4345e64031e800e Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 12 Jun 2026 12:49:26 -0400
Subject: [PATCH 15/38] Update thoughts.md

metrics shortlist
---
 demo/thoughts.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 0053c29..19fb243 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -24,11 +24,9 @@ Stackable metrics to begin with:
 
 | Metric | What it checks |
 |--------|---------------|
-| Answer relevance | Does the response address what the user actually asked? |
 | Faithfulness | Is every claim grounded in the retrieved context — no hallucinations? |
-| Metric validity | Did the model avoid inventing metrics not defined in the semantic model? |
-| Terminology | Did the model use the correct domain vocabulary? |
 | Completeness | Did the response cover the full scope of the question at the right level of detail? |
+| Answer relevance | Does the response address what the user actually asked? |
 
 ---
 

From 87d47248c2c58c35b3914459342cde0f8f364c64 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Mon, 15 Jun 2026 10:46:39 -0400
Subject: [PATCH 16/38] Enhance goal description for evals framework

Expanded the goal description for the evals framework to include details about existing SQL tests and the addition of non-deterministic evals.
---
 demo/thoughts.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 19fb243..4aec1e6 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -2,7 +2,7 @@
 
 ## What we're building
 
-**Goal** — implement an evals framework that quantifies the impact of context changes on the nao Chat BI tool. We are not testing the LLM or general chat performance. We are testing one specific thing: did a change to the context — RULES.md, semantic model definitions, or similar input files — make the assistant's answers better or worse? The eval score is a signal for context quality, not model quality.
+**Goal** — implement an evals framework that quantifies the impact of context changes on the nao Chat BI tool. We are not testing the LLM or general chat performance. We are testing one specific thing: did a change to the context — RULES.md, semantic model definitions, or similar input files — make the assistant's answers better or worse? The eval score is a signal for context quality, not model quality. Nao already has SQL tests in place that guard against schema linking failures and semantic gaps. We are looking to add non-deterministic evals that catch failures when the SQL and even the number is correct.
 
 We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this project. Here is what each term means:
 

From df3a1d0161f2a44315a7babf07796a02d5fb2ffa Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Mon, 15 Jun 2026 11:30:59 -0400
Subject: [PATCH 17/38] Document design choices for evals framework

Added section on design choices for evals framework.
---
 demo/thoughts.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 4aec1e6..4b55abc 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -4,6 +4,9 @@
 
 **Goal** — implement an evals framework that quantifies the impact of context changes on the nao Chat BI tool. We are not testing the LLM or general chat performance. We are testing one specific thing: did a change to the context — RULES.md, semantic model definitions, or similar input files — make the assistant's answers better or worse? The eval score is a signal for context quality, not model quality. Nao already has SQL tests in place that guard against schema linking failures and semantic gaps. We are looking to add non-deterministic evals that catch failures when the SQL and even the number is correct.
 
+
+## Design choices
+
 We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this project. Here is what each term means:
 
 **LLM-as-a-judge** — instead of checking outputs with deterministic rules or exact string matches, a second LLM (the "judge") reads the assistant's response and scores it against expected answer. This handles the inherent non-determinism of Chat BI.

From e8d6be6da617d7c6b4ff8f9594b7ddb45b463c3a Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Mon, 15 Jun 2026 11:55:04 -0400
Subject: [PATCH 18/38] Revise metrics and evaluation framework in thoughts.md

Updated metrics section to include RAG triad methodology and clarified evaluation criteria for curated context.
---
 demo/thoughts.md | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 4b55abc..c176c4c 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -23,14 +23,6 @@ We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this pro
 
 **Eval framework (deepeval or similar)** — rather than hand-rolling judge prompts, we intend to use an established eval framework. This means the noa labs team does not need to become prompt engineering experts just to run evaluations — the framework owns the judge prompt design and scoring logic. It also provides a foundation that can be extended to multi-turn evals later without rebuilding from scratch.
 
-Stackable metrics to begin with:
-
-| Metric | What it checks |
-|--------|---------------|
-| Faithfulness | Is every claim grounded in the retrieved context — no hallucinations? |
-| Completeness | Did the response cover the full scope of the question at the right level of detail? |
-| Answer relevance | Does the response address what the user actually asked? |
-
 ---
 
 ## Framework options
@@ -47,6 +39,27 @@ Stackable metrics to begin with:
 
 ---
 
+# Metrics borrow from RAG triad
+
+RAG retrives context, which is different from what nao's users are doing - they curate context. We can account for this difference and reuse RAG metrics. 
+
+RAG traid tests relationship b/w three entities: Question, Context and Response with metrics around 1. `Context Relevance` - is the context retried relevant to the question; 2. answer `Faithfulness` - was the answer grounded in retrived context; 3. `Answer Relevance` - is the answer relevant to what was asked. We can re-use this methodology by substituting retrived context with curated context. So the question changes from `Was the right context retrived` to `Was the right context curated`? 
+
+| Metric | Inputs | What it checks |
+|--------|--------|-------|
+| Context Relevance | input and curated_context | Was the context relevant to the question? |
+| Faithfulness | actual_output and curated_context | Is every claim grounded in the curated context — no hallucinations? |
+| Answer relevance | input and actual_output | Does the response address what the user actually asked? |
+| Completeness | input and actual_output | Did the response cover the full scope of the question at the right level of detail? |
+
+So the challenge will be to attache curated context to the test at runtime. Other than that, the triad itself seems very close to what nao users might want to accomplish with their evals.
+
+
+
+
+## Do not use the rest
+---
+
 ## Proposed entry format
 
 ```yaml

From 680b7809ae6d0193ea006107e2fa74eb4846cd44 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Mon, 15 Jun 2026 12:01:49 -0400
Subject: [PATCH 19/38] Modify inputs for Faithfulness metric in thoughts.md

Updated the inputs for the Faithfulness metric to include 'input'.
---
 demo/thoughts.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index c176c4c..05d1588 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -48,7 +48,7 @@ RAG traid tests relationship b/w three entities: Question, Context and Response
 | Metric | Inputs | What it checks |
 |--------|--------|-------|
 | Context Relevance | input and curated_context | Was the context relevant to the question? |
-| Faithfulness | actual_output and curated_context | Is every claim grounded in the curated context — no hallucinations? |
+| Faithfulness | input, actual_output, and curated_context | Is every claim grounded in the curated context — no hallucinations? |
 | Answer relevance | input and actual_output | Does the response address what the user actually asked? |
 | Completeness | input and actual_output | Did the response cover the full scope of the question at the right level of detail? |
 

From 6bee62020cbcfc2fb611e5ab68d2b1bd4de4b065 Mon Sep 17 00:00:00 2001
From: Anna Bekharskaia <anna.v.bekharskaya@gmail.com>
Date: Mon, 22 Jun 2026 22:07:09 +0300
Subject: [PATCH 20/38] Document demo eval test constraints

---
 demo/plan_pr2.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/demo/plan_pr2.md b/demo/plan_pr2.md
index 04254d3..ef8d387 100644
--- a/demo/plan_pr2.md
+++ b/demo/plan_pr2.md
@@ -99,6 +99,10 @@ Test whether **curated project context** drives correct chat behavior — not th
 
 **SQL:** `sql:` must use real columns from [`marts.yml`](models/marts/marts.yml) and only metrics defined in [`semantic_models.yml`](models/semantic/semantic_models.yml).
 
+**Time windows:** if expected SQL has no date filter and aggregates over all available data, the prompt must explicitly ask for "full history"; otherwise the default last-7-days rule applies.
+
+**Reproducibility:** use `dbt run --full-refresh` for demo test setup when previous incremental state could affect eval outputs.
+
 **Narrative:** `charge attempt`, `transaction`, `visit` — never `session`. Presentation rules (%, pp) apply to **answer quality** evaluation, not to the `sql:` block.
 
 **No extra context in test prompts** — no injected chunks; tests exercise curated nao context only (`agent_instructions.md`, `RULES.md`, etc.).

From 8cd30938c70ac35170e150470c53dafc3005c707 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Wed, 24 Jun 2026 12:43:29 -0400
Subject: [PATCH 21/38] plan.MD

---
 .gitignore |   2 +-
 plan.MD    | 327 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 plan.MD

diff --git a/.gitignore b/.gitignore
index d5f8eb5..f5ffdee 100644
--- a/.gitignore
+++ b/.gitignore
@@ -5,7 +5,7 @@ dbt_packages/
 logs/
 
 # Python virtual environment
-venv/
+.venv/
 .dbt/
 
 # Local secrets — copy from .env.example and fill in
diff --git a/plan.MD b/plan.MD
new file mode 100644
index 0000000..28c9c7f
--- /dev/null
+++ b/plan.MD
@@ -0,0 +1,327 @@
+# Plan: LLM-as-Judge RAG Triad Evals for nao
+
+## Background
+
+nao's agent answers questions by reading file-based project context (schema docs, `columns.md`,
+`preview.md`) and executing SQL. When a user asks a question, the agent:
+
+1. Reads relevant files from the project folder (`read_file`, `grep`, `list` tools)
+2. Executes SQL queries (`execute_sql` tool)
+3. Composes a natural language response from those tool outputs
+
+The **curated context** for a faithfulness eval is exactly those tool outputs — the files the agent
+read and the SQL results it used to compose its answer.
+
+---
+
+## Two Approaches to Curated Context
+
+### Option A — Static: bake context into the golden dataset
+
+Pre-record the tool outputs when building the dataset. Each golden record carries:
+
+```jsonl
+{
+  "id": "q001",
+  "input": "What were total sales by region last quarter?",
+  "expected_output": "...",
+  "curated_context": [
+    "[File: databases/.../columns.md]\n# users table...",
+    "[SQL result]\nColumns: region, total\n| North | 1200000 |..."
+  ]
+}
+```
+
+The eval test reads context directly from the file — no running backend needed.
+
+**Downside:** Context goes stale the moment the project folder, schema docs, or SQL results change.
+If nao's prompt is updated and the agent now reads different files or generates different SQL, the
+golden context no longer reflects what the model actually used. You end up judging faithfulness
+against a context that wasn't actually provided to the LLM — which defeats the metric entirely.
+Every dataset update becomes a manual two-step: run the agent, copy the tool outputs, paste them
+into the JSONL. This does not scale past a handful of records.
+
+---
+
+### Option B — Dynamic: capture context at eval runtime (chosen approach)
+
+At eval time, call a dedicated nao endpoint that runs the agent and returns **both** the final
+answer and the exact tool results that produced it, in a single response. The eval harness uses
+those tool results directly as `retrieval_context` for DeepEval. No context is stored in the
+dataset.
+
+**Why this is correct:**
+
+- The context passed to DeepEval is always the context that was actually provided to the LLM for
+  *that specific run*. There is no staleness.
+- If nao's prompt changes, the agent reads different files, or the database is updated, the eval
+  automatically reflects the new reality — it measures faithfulness of what nao actually said
+  against what nao actually saw.
+- The golden dataset stays lean: only `input` and `expected_output`. No manual context curation.
+- It also enables a future regression signal: if faithfulness drops after a prompt or tool change,
+  you know the model started making claims not grounded in its own context.
+
+The only requirement is a running nao backend, which is already needed to get `actual_output`
+anyway. Option B adds zero extra infrastructure.
+
+---
+
+## Architecture
+
+### What exists today
+
+`AgentManager` in `apps/backend/src/services/agent.ts` has two execution modes:
+
+- `stream()` — used by the main `/api/agent` route; returns a UI message stream to the browser
+- `generate()` — non-streaming; returns `AgentRunResult` which already contains:
+  ```typescript
+  steps: ReadonlyArray<{
+    toolCalls: ReadonlyArray<{ toolName: string; toolCallId: string; input: unknown }>;
+    toolResults: ReadonlyArray<{ toolCallId: string; output?: unknown }>;
+  }>
+  ```
+
+`steps[*].toolResults` is the exact grounding context. It is already collected internally — it just
+has no HTTP endpoint that exposes it.
+
+---
+
+## What Changes in the Backend (minimal)
+
+Add one new Fastify route: `POST /api/evals/chat`.
+
+It mirrors `/api/agent` but:
+
+1. Calls `agent.generate()` instead of `agent.stream()` — blocks until the agent finishes, returns
+   `AgentRunResult`
+2. Returns JSON `{ text, model_id, tool_results }` instead of a UI message stream
+3. `tool_results` = `steps.flatMap(s => s.toolResults)` filtered to context-bearing tools:
+   `execute_sql`, `read`, `grep`, `list`, `search`
+4. Accepts an optional `model: { provider, modelId }` body field to override the project default
+5. Protected by the same `authMiddleware` as every other route — no additional gate needed
+
+No changes to existing routes or agent internals. `generate()` is already exercised by automations.
+
+**Response shape:**
+```json
+{
+  "text": "Total sales by region last quarter were: North $1.2M...",
+  "model_id": "claude-sonnet-4-6",
+  "tool_results": [
+    { "toolName": "read", "output": { "content": "# sales table\n..." } },
+    { "toolName": "execute_sql", "output": { "columns": ["region", "total"], "data": [...] } }
+  ]
+}
+```
+
+`model_id` reflects whichever model ran — either the project default from Settings → Project →
+Models, or the override passed in the request body.
+
+---
+
+## Directory Layout
+
+Evals follow the same project-relative convention as `nao test`. The framework lives in the
+`nao_core` package; the user's golden dataset lives in their project folder alongside other test
+assets.
+
+```
+# nao_core package (framework — shipped with nao)
+cli/nao_core/evals/
+  conftest.py              # ALL shared fixtures
+  test_evals_triad.py      # all metrics in one file
+
+# User's project (data — owned by the team)
+{project}/tests/evals/
+  golden_dataset.jsonl     # {id, input, expected_output} — no context field
+```
+
+This mirrors how `nao test` works: `nao_core` contains the runner logic; the user's `tests/` folder
+contains the test cases. Running `nao evals` from the project directory discovers
+`tests/evals/golden_dataset.jsonl` via `Path.cwd()`, exactly as `nao test` discovers
+`tests/*.yml`.
+
+**One file, one test function, all metrics together.** DeepEval's `assert_test(test_case,
+metrics=[m1, m2, ...])` runs all metrics concurrently on the same test case. This means the
+`/api/evals/chat` API call happens once per record regardless of how many metrics are active.
+Separate files per metric would repeat that call once per metric per record — 4× the cost once all
+four metrics are in place.
+
+The file is named `test_evals_triad.py` after the established RAG triad (faithfulness, context
+relevance, answer relevance), with completeness as a fourth extension. The context here is not
+retrieved but captured from the agent's own tool outputs at runtime — the triad is re-purposed for
+curated context rather than retrieval.
+
+Adding a metric later = one new fixture in `conftest.py` + one new entry in the `metrics=[]` list.
+Nothing else changes.
+
+---
+
+## Golden Dataset Schema
+
+```jsonl
+{"id": "q001", "input": "What were total sales by region last quarter?", "expected_output": "..."}
+{"id": "q002", "input": "Which customers churned in May?", "expected_output": "..."}
+```
+
+No `curated_context` field. Context is captured live.
+
+---
+
+## Eval Harness Flow (per test case)
+
+```
+for each record in golden_dataset.jsonl:
+  1. POST /api/evals/chat  { input: record.input, model?: { provider, modelId } }
+  2. → { text: <actual_output>, model_id: <model>, tool_results: [...] }
+  3. curated_context = [serialize(tr) for tr in tool_results]
+  4. LLMTestCase(
+       input             = record.input,
+       actual_output     = response.text,
+       retrieval_context = curated_context,
+       expected_output   = record.expected_output   # used by completeness later
+     )
+  5. assert_test(test_case, metrics=[
+       faithfulness_metric,
+       context_relevance_metric,
+       answer_relevance_metric,
+     ])
+```
+
+All three metrics run concurrently on the same test case. DeepEval reports each score and reason
+independently — a failure in one does not suppress the others.
+
+**Tool result serialization** (pure Python in `conftest.py`, not backend logic):
+
+| Tool | Serialized as |
+|---|---|
+| `read` | `"[File: {path}]\n{content}"` |
+| `execute_sql` | `"[SQL result]\nColumns: {cols}\n{rows as markdown table}"` |
+| `grep` | `"[Search: {pattern}]\n{matches}"` |
+| `list` | `"[Directory: {path}]\n{entries}"` |
+
+---
+
+## DeepEval Metric Configuration
+
+The judge model is read from the `/api/evals/chat` response `model_id` field — whatever ran the
+agent. `conftest.py` instantiates the judge from that model ID:
+
+- **Claude models** (`model_id.startswith("claude")`) — wrapped in a custom `_ClaudeJudge`
+  subclass of `DeepEvalBaseLLM` using the `anthropic` SDK directly
+- **OpenAI and other providers** — passed as a plain string; DeepEval handles them natively
+
+```python
+FaithfulnessMetric(threshold=0.7, model=judge, include_reason=True)
+ContextualRelevancyMetric(threshold=0.7, model=judge, include_reason=True)
+AnswerRelevancyMetric(threshold=0.7, model=judge, include_reason=True)
+```
+
+Thresholds start at 0.7 across the board — tighten per metric as the dataset matures and baselines
+become clear.
+
+Using the same model as judge introduces self-serving bias — a model tends to score its own outputs
+higher. To use an independent judge, pass `-m` with a different model than the one configured in
+project settings — the agent runs with that model, and the judge uses the same model from the
+response `model_id`. No code changes needed.
+
+---
+
+## `conftest.py` Fixtures
+
+- **`nao_client`** — `httpx.Client` pointed at `http://localhost:5005` (or `NAO_EVAL_URL` env
+  var). Session-scoped. Cookie set from `NAO_AUTH_COOKIE` env var (injected by `nao evals` runner).
+- **`pytest_generate_tests`** — parametrizes over records in `golden_dataset.jsonl`. Dataset path
+  comes from `NAO_EVALS_DIR` env var (set by runner to `{cwd}/tests/evals`).
+- **`eval_response`** — calls `POST /api/evals/chat` for each record. Passes `model` body field if
+  `NAO_EVAL_MODEL` is set.
+- **`judge_model`** — builds a `_ClaudeJudge` instance for Claude models; returns plain string for
+  OpenAI (DeepEval native).
+- **`llm_test_case`** — serializes tool results into `retrieval_context`, returns `LLMTestCase`.
+- **`faithfulness_metric`**, **`context_relevance_metric`**, **`answer_relevance_metric`** — metric
+  instances using the judge model.
+- **`completeness_metric`** — add later; `GEval` with custom rubric (see below).
+
+---
+
+## Running the Evals
+
+```bash
+# Start nao backend normally
+npm run dev -w @nao/backend
+
+# Run all evals (from your project directory)
+nao evals
+
+# Specify the agent model (mirrors nao test -m)
+nao evals -m anthropic:claude-sonnet-4-6
+
+# Verbose output
+nao evals -v
+
+# Run a single record by ID
+nao evals -s q001
+
+# Explicit credentials
+nao evals -u user@example.com --password secret
+```
+
+Install eval deps once with `uv sync --extra evals` from `cli/`. The `evals` extra adds only
+`deepeval` — provider SDKs come from the user's existing nao installation.
+
+Add `make evals` to the root `Makefile` for consistency with `make lint`.
+
+---
+
+## CI
+
+Do not block PRs on evals — they are slow (~5–10 s per LLM judge call) and cost money.
+
+Add `.github/workflows/evals.yml`:
+- Trigger: `workflow_dispatch` (manual) + `schedule: cron: '0 6 * * 1'` (weekly, Monday 06:00 UTC)
+- Spins up nao with a test project pointed at a fixture database
+- Runs the eval suite
+- Publishes scores as a job summary table (record ID, score, reason for any failures)
+
+This gives a weekly regression signal without gating every PR.
+
+---
+
+## Implementation Sequence
+
+1. **Add `POST /api/evals/chat` Fastify route** — ~40 lines in `apps/backend/src/routes/evals.ts`;
+   calls existing `agent.generate()`, returns `{ text, model_id, tool_results }`, accepts optional
+   `model` override
+2. **Register route** in `apps/backend/src/app.ts` under `/api/evals`
+3. **Write `cli/nao_core/evals/conftest.py`** — fixtures, judge model factory, tool result
+   serializer, dataset loader via `NAO_EVALS_DIR`
+4. **Write `cli/nao_core/evals/test_evals_triad.py`** — single parametrized test,
+   `assert_test(test_case, metrics=[...])`
+5. **Write `cli/nao_core/commands/evals/runner.py`** — `nao evals` command mirroring `nao test`;
+   `-m`, `-u`, `--password`, `-v`, `-s` flags; authenticates via `get_auth_session`, sets env vars,
+   invokes pytest
+6. **Register command** in `cli/nao_core/commands/__init__.py` and `cli/nao_core/main.py`
+7. **Add `evals` extra** to `cli/pyproject.toml` — `deepeval>=2.0` only
+8. **Populate `{project}/tests/evals/golden_dataset.jsonl`** — Q&A pairs from golden dataset
+9. **Wire CI** — add `.github/workflows/evals.yml`
+
+---
+
+## Adding Completeness Later
+
+The full triad ships from day one. Completeness is the only metric deferred — it has no
+first-class DeepEval class yet. When ready, adding it is two steps:
+
+1. Add a `completeness_metric` fixture in `conftest.py` using `GEval` with a rubric such as
+   "does the answer fully address all parts of the question, given the expected output?"
+2. Add it to the `metrics=[]` list in `test_evals_triad.py`
+
+No new files, no API changes, no dataset changes — `expected_output` is already in every
+`LLMTestCase` from day one.
+
+| Metric | DeepEval class | Ships | Fields used |
+|---|---|---|---|
+| Faithfulness | `FaithfulnessMetric` | Day 1 | `input`, `actual_output`, `retrieval_context` |
+| Context relevance | `ContextualRelevancyMetric` | Day 1 | `input`, `retrieval_context` |
+| Answer relevance | `AnswerRelevancyMetric` | Day 1 | `input`, `actual_output` |
+| Completeness | `GEval` (custom rubric) | Later | `input`, `actual_output`, `expected_output` |

From d26cd75d7aa99b2d6d72da9e857a20297e3a4e09 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 10:37:18 -0400
Subject: [PATCH 22/38] Update plan.MD

what we learned from running evals: exclude `list` from a context fetching tools, lower context relevancy threshold to account for broader curated context
---
 plan.MD | 49 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/plan.MD b/plan.MD
index 28c9c7f..ecb630b 100644
--- a/plan.MD
+++ b/plan.MD
@@ -5,12 +5,13 @@
 nao's agent answers questions by reading file-based project context (schema docs, `columns.md`,
 `preview.md`) and executing SQL. When a user asks a question, the agent:
 
-1. Reads relevant files from the project folder (`read_file`, `grep`, `list` tools)
+1. Reads relevant files from the project folder (`read`, `grep`, `search`, `list` tools)
 2. Executes SQL queries (`execute_sql` tool)
 3. Composes a natural language response from those tool outputs
 
-The **curated context** for a faithfulness eval is exactly those tool outputs — the files the agent
-read and the SQL results it used to compose its answer.
+The **curated context** for a faithfulness eval is the subset of tool outputs that carry actual
+content: `execute_sql`, `read`, `grep`, and `search`. The `list` tool is excluded — it returns
+directory metadata (path/dir/size), not file content, so it adds no semantic signal to the eval.
 
 ---
 
@@ -95,8 +96,10 @@ It mirrors `/api/agent` but:
 1. Calls `agent.generate()` instead of `agent.stream()` — blocks until the agent finishes, returns
    `AgentRunResult`
 2. Returns JSON `{ text, model_id, tool_results }` instead of a UI message stream
-3. `tool_results` = `steps.flatMap(s => s.toolResults)` filtered to context-bearing tools:
-   `execute_sql`, `read`, `grep`, `list`, `search`
+3. `tool_results` = `steps.flatMap(s => s.toolResults)` filtered to content-bearing tools:
+   `execute_sql`, `read`, `grep`, `search`
+   — `list` is intentionally excluded: it returns directory metadata (path/dir/size), not file
+   content, so it adds no semantic signal and dilutes Contextual Relevancy scores
 4. Accepts an optional `model: { provider, modelId }` body field to override the project default
 5. Protected by the same `authMiddleware` as every other route — no additional gate needed
 
@@ -193,12 +196,16 @@ independently — a failure in one does not suppress the others.
 
 **Tool result serialization** (pure Python in `conftest.py`, not backend logic):
 
-| Tool | Serialized as |
-|---|---|
-| `read` | `"[File: {path}]\n{content}"` |
-| `execute_sql` | `"[SQL result]\nColumns: {cols}\n{rows as markdown table}"` |
-| `grep` | `"[Search: {pattern}]\n{matches}"` |
-| `list` | `"[Directory: {path}]\n{entries}"` |
+| Tool | Serialized as | Why included |
+|---|---|---|
+| `execute_sql` | `"[SQL result]\nColumns: {cols}\n{rows as markdown table}"` | The primary grounding — actual query results the agent used |
+| `read` | `"[File: {path}]\n{content}"` | Schema docs, column descriptions, semantic models the agent read |
+| `grep` | `"[grep: {pattern}]\n{matches}"` | File content snippets matched during search |
+| `search` | `"[search]\n{raw output}"` | Semantic search results over project context |
+
+`list` is excluded — it returns directory metadata only (path/dir/size), never file content, and
+consistently dilutes Contextual Relevancy by adding entries with no semantic relationship to the
+question.
 
 ---
 
@@ -217,8 +224,24 @@ ContextualRelevancyMetric(threshold=0.7, model=judge, include_reason=True)
 AnswerRelevancyMetric(threshold=0.7, model=judge, include_reason=True)
 ```
 
-Thresholds start at 0.7 across the board — tighten per metric as the dataset matures and baselines
-become clear.
+**Contextual Relevancy threshold is intentionally lower than the other two metrics.** In a classic
+RAG pipeline, retrieval is scoped tightly to the question — only the most relevant chunks are
+fetched. A chat agent works differently: it reads schema docs, semantic model definitions, and
+column descriptions to build a broad understanding of the data model before it can compose a
+correct answer. Much of that context is genuinely necessary for the agent to reason well, but it is
+not directly about the specific question. The result is that Contextual Relevancy is structurally
+diluted compared to a purpose-built retrieval system — by design, not by failure.
+
+This means a score of 0.5–0.65 on Contextual Relevancy may reflect a well-functioning agent
+rather than a broken retrieval step. The practical threshold to start with:
+
+```python
+FaithfulnessMetric(threshold=0.7, ...)          # grounding: strict
+ContextualRelevancyMetric(threshold=0.5, ...)   # broad context: looser
+AnswerRelevancyMetric(threshold=0.7, ...)       # answer quality: strict
+```
+
+Calibrate from observed baselines as the dataset grows, not from RAG benchmarks.
 
 Using the same model as judge introduces self-serving bias — a model tends to score its own outputs
 higher. To use an independent judge, pass `-m` with a different model than the one configured in

From ad39fcc5feb642ddbf8cae128921bf17e8fc63b1 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 11:42:58 -0400
Subject: [PATCH 23/38] Update plan.MD

added example output
---
 plan.MD | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/plan.MD b/plan.MD
index ecb630b..d926fa7 100644
--- a/plan.MD
+++ b/plan.MD
@@ -294,6 +294,59 @@ Install eval deps once with `uv sync --extra evals` from `cli/`. The `evals` ext
 
 Add `make evals` to the root `Makefile` for consistency with `make lint`.
 
+Results are written to `{project}/tests/outputs/evals_{timestamp}.json`:
+
+```json
+{
+  "timestamp": "2026-06-25T11:32:24.385414",
+  "results": [
+    {
+      "id": "q001",
+      "input": "How many ports are currently decommissioned?",
+      "passed": false,
+      "metrics": [
+        {
+          "name": "Faithfulness",
+          "score": 0.3333,
+          "threshold": 0.7,
+          "passed": false,
+          "reason": "The score is 0.33 because the actual output contradicts the retrieval context in multiple ways regarding the decommissioned ports data..."
+        },
+        {
+          "name": "Contextual Relevancy",
+          "score": 0.2667,
+          "threshold": 0.5,
+          "passed": false,
+          "reason": "The score is 0.27 because while there are some tangentially relevant elements such as a file named 'decommissioned_ports_check.yml'..."
+        },
+        {
+          "name": "Answer Relevancy",
+          "score": 1.0,
+          "threshold": 0.7,
+          "passed": true,
+          "reason": "The score is 1.00 because the response directly and completely addresses the question about the number of currently decommissioned ports. Great job!"
+        }
+      ]
+    },
+    {
+      "id": "q002",
+      "input": "What is the overall uptime percentage of my EV charging network for the full history?",
+      "passed": false,
+      "metrics": [
+        { "name": "Faithfulness", "score": 1.0, "threshold": 0.7, "passed": true, "reason": "..." },
+        { "name": "Contextual Relevancy", "score": 0.4737, "threshold": 0.5, "passed": false, "reason": "The score is 0.47 because while some retrieved context is highly relevant — including the 'uptime' semantic model — a significant portion covers unrelated models like 'charge_attempts', 'fact_visits', 'dim_dates'..." },
+        { "name": "Answer Relevancy", "score": 1.0, "threshold": 0.7, "passed": true, "reason": "..." }
+      ]
+    }
+  ]
+}
+```
+
+Both records fail on Contextual Relevancy — consistent with the structural dilution described above.
+Answer Relevancy is perfect (1.0) on both: the agent answers the question correctly even when it
+fetches broader context than strictly needed. Faithfulness failure on q001 indicates the agent made
+claims not grounded in what it actually retrieved — a signal worth investigating.
+
 ---
 
 ## CI

From 58ee3a3bf77674c44b9c394a1aefca80fad46c4a Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 21:38:30 -0400
Subject: [PATCH 24/38] Update plan.MD

removed search + fixed tests
---
 plan.MD | 74 ++++++++++++++++++++++++---------------------------------
 1 file changed, 31 insertions(+), 43 deletions(-)

diff --git a/plan.MD b/plan.MD
index d926fa7..f940d86 100644
--- a/plan.MD
+++ b/plan.MD
@@ -97,9 +97,11 @@ It mirrors `/api/agent` but:
    `AgentRunResult`
 2. Returns JSON `{ text, model_id, tool_results }` instead of a UI message stream
 3. `tool_results` = `steps.flatMap(s => s.toolResults)` filtered to content-bearing tools:
-   `execute_sql`, `read`, `grep`, `search`
-   — `list` is intentionally excluded: it returns directory metadata (path/dir/size), not file
-   content, so it adds no semantic signal and dilutes Contextual Relevancy scores
+   `execute_sql`, `read`, `grep`
+   — `list` and `search` are intentionally excluded: both return only file metadata
+   (path/dir/size), not file content. They are navigation tools — the agent uses them to decide
+   what to read next, not to compose its answer. Including them dilutes Contextual Relevancy
+   with path listings that have no semantic relationship to the question.
 4. Accepts an optional `model: { provider, modelId }` body field to override the project default
 5. Protected by the same `authMiddleware` as every other route — no additional gate needed
 
@@ -196,16 +198,18 @@ independently — a failure in one does not suppress the others.
 
 **Tool result serialization** (pure Python in `conftest.py`, not backend logic):
 
-| Tool | Serialized as | Why included |
-|---|---|---|
-| `execute_sql` | `"[SQL result]\nColumns: {cols}\n{rows as markdown table}"` | The primary grounding — actual query results the agent used |
-| `read` | `"[File: {path}]\n{content}"` | Schema docs, column descriptions, semantic models the agent read |
-| `grep` | `"[grep: {pattern}]\n{matches}"` | File content snippets matched during search |
-| `search` | `"[search]\n{raw output}"` | Semantic search results over project context |
+| Tool | Returns | Included | Reason |
+|---|---|---|---|
+| `execute_sql` | data rows | yes | primary grounding — the actual query results the agent used |
+| `read` | full file content | yes | schema docs, column descriptions, semantic models |
+| `grep` | matching lines + context | yes | actual file content snippets |
+| `search` | path/dir/size | **no** | glob file finder — navigation only, no content |
+| `list` | path/dir/size | **no** | directory listing — navigation only, no content |
 
-`list` is excluded — it returns directory metadata only (path/dir/size), never file content, and
-consistently dilutes Contextual Relevancy by adding entries with no semantic relationship to the
-question.
+`search` and `list` are excluded for the same reason: both return only file metadata. They are
+navigation tools the agent uses to decide what to read next — not to compose its answer. Including
+them adds path listings with no semantic relationship to the question, diluting Contextual
+Relevancy without adding grounding signal.
 
 ---
 
@@ -298,54 +302,38 @@ Results are written to `{project}/tests/outputs/evals_{timestamp}.json`:
 
 ```json
 {
-  "timestamp": "2026-06-25T11:32:24.385414",
+  "timestamp": "2026-06-25T21:35:40.653031",
   "results": [
     {
       "id": "q001",
       "input": "How many ports are currently decommissioned?",
-      "passed": false,
+      "actual_output": "There are currently 4 decommissioned ports (i.e., ports with a non-null decommissioned_ts).",
+      "passed": true,
       "metrics": [
-        {
-          "name": "Faithfulness",
-          "score": 0.3333,
-          "threshold": 0.7,
-          "passed": false,
-          "reason": "The score is 0.33 because the actual output contradicts the retrieval context in multiple ways regarding the decommissioned ports data..."
-        },
-        {
-          "name": "Contextual Relevancy",
-          "score": 0.2667,
-          "threshold": 0.5,
-          "passed": false,
-          "reason": "The score is 0.27 because while there are some tangentially relevant elements such as a file named 'decommissioned_ports_check.yml'..."
-        },
-        {
-          "name": "Answer Relevancy",
-          "score": 1.0,
-          "threshold": 0.7,
-          "passed": true,
-          "reason": "The score is 1.00 because the response directly and completely addresses the question about the number of currently decommissioned ports. Great job!"
-        }
+        { "name": "Faithfulness",          "score": 1.0,  "threshold": 0.7, "passed": true,  "reason": "The score is 1.00 because the actual output is perfectly faithful to the retrieval context with no contradictions found!" },
+        { "name": "Contextual Relevancy",  "score": 1.0,  "threshold": 0.5, "passed": true,  "reason": "The score is 1.00 because the retrieval context is perfectly relevant, directly addressing the question with a specific SQL query counting decommissioned ports and providing the exact answer: 'There are 4 decommissioned ports.'" },
+        { "name": "Answer Relevancy",      "score": 1.0,  "threshold": 0.7, "passed": true,  "reason": "The score is 1.00 because the response directly and completely addresses the question. Great job staying on topic!" }
       ]
     },
     {
       "id": "q002",
       "input": "What is the overall uptime percentage of my EV charging network for the full history?",
-      "passed": false,
+      "actual_output": "Overall Uptime: 99.71% across Sep 15, 2025 – Jun 26, 2026. All 4,986 downtime minutes were concentrated in October 2025. Every other month recorded 100% uptime.",
+      "passed": true,
       "metrics": [
-        { "name": "Faithfulness", "score": 1.0, "threshold": 0.7, "passed": true, "reason": "..." },
-        { "name": "Contextual Relevancy", "score": 0.4737, "threshold": 0.5, "passed": false, "reason": "The score is 0.47 because while some retrieved context is highly relevant — including the 'uptime' semantic model — a significant portion covers unrelated models like 'charge_attempts', 'fact_visits', 'dim_dates'..." },
-        { "name": "Answer Relevancy", "score": 1.0, "threshold": 0.7, "passed": true, "reason": "..." }
+        { "name": "Faithfulness",          "score": 0.8889, "threshold": 0.7, "passed": true,  "reason": "The score is 0.89 because the actual output incorrectly states total commissioned minutes as 1,722,240 when the retrieval context specifies 1,716,480." },
+        { "name": "Contextual Relevancy",  "score": 0.507,  "threshold": 0.5, "passed": true,  "reason": "The score is 0.51 because while a significant portion of the retrieval context is irrelevant (empty SQL results, database metadata, schema config), the context contains the direct answer '99.71% uptime' and supporting breakdowns. The large volume of irrelevant context dilutes the score." },
+        { "name": "Answer Relevancy",      "score": 0.8182, "threshold": 0.7, "passed": true,  "reason": "The score is 0.82 because the output addresses overall uptime but includes unnecessary detail about which chargers caused the downtime events — not directly relevant to the question asked." }
       ]
     }
   ]
 }
 ```
 
-Both records fail on Contextual Relevancy — consistent with the structural dilution described above.
-Answer Relevancy is perfect (1.0) on both: the agent answers the question correctly even when it
-fetches broader context than strictly needed. Faithfulness failure on q001 indicates the agent made
-claims not grounded in what it actually retrieved — a signal worth investigating.
+Both records pass. q001 is a clean 1.0 across all metrics — the agent found the exact SQL and answer.
+q002 passes but reveals two signals worth tracking: Faithfulness at 0.89 (the agent misreported
+one number) and Contextual Relevancy just above threshold at 0.51 (broad schema context still
+diluting the score, consistent with the structural dilution described above).
 
 ---
 

From b89b4f73f6df32e0d37cf85b94a89a82db5f023b Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 22:08:27 -0400
Subject: [PATCH 25/38] Update curated context description for evaluations

Clarified the context description for evaluations by removing the word 'faithfulness' and specifying the tools included and excluded.
---
 plan.MD | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/plan.MD b/plan.MD
index f940d86..67a7eaa 100644
--- a/plan.MD
+++ b/plan.MD
@@ -9,7 +9,7 @@ nao's agent answers questions by reading file-based project context (schema docs
 2. Executes SQL queries (`execute_sql` tool)
 3. Composes a natural language response from those tool outputs
 
-The **curated context** for a faithfulness eval is the subset of tool outputs that carry actual
+The **curated context** for evals is the subset of tool outputs that carry actual
 content: `execute_sql`, `read`, `grep`, and `search`. The `list` tool is excluded — it returns
 directory metadata (path/dir/size), not file content, so it adds no semantic signal to the eval.
 

From 06597d40f4c4e4c99acd8d592c25af6ac74f5a1d Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 22:09:56 -0400
Subject: [PATCH 26/38] Update plan.MD

exclude search
---
 plan.MD | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/plan.MD b/plan.MD
index 67a7eaa..0e01ffc 100644
--- a/plan.MD
+++ b/plan.MD
@@ -10,8 +10,9 @@ nao's agent answers questions by reading file-based project context (schema docs
 3. Composes a natural language response from those tool outputs
 
 The **curated context** for evals is the subset of tool outputs that carry actual
-content: `execute_sql`, `read`, `grep`, and `search`. The `list` tool is excluded — it returns
-directory metadata (path/dir/size), not file content, so it adds no semantic signal to the eval.
+content: `execute_sql`, `read`, `grep`. Both `list` and `search` are excluded — they return only
+file metadata (path/dir/size), not file content. They are navigation tools the agent uses to
+decide what to read next, not to compose its answer.
 
 ---
 

From d54f83355f777332daae4eb2f9286b6faa381094 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 22:13:50 -0400
Subject: [PATCH 27/38] Update plan.MD

maintenance burden
---
 plan.MD | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/plan.MD b/plan.MD
index 0e01ffc..134791b 100644
--- a/plan.MD
+++ b/plan.MD
@@ -66,6 +66,11 @@ dataset.
 The only requirement is a running nao backend, which is already needed to get `actual_output`
 anyway. Option B adds zero extra infrastructure.
 
+**Maintenance burden:** `CONTEXT_TOOLS` in `apps/backend/src/routes/evals.ts` must be kept in sync
+as new agent tools are added. When a new tool is introduced, the nao team needs to decide whether
+its output is content-bearing (add it) or navigation-only (leave it out). This is a small but
+ongoing maintenance cost that Option A does not have.
+
 ---
 
 ## Architecture

From b914765c48fe2c2135bde29cd8a45b4dd858dd6e Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 22:21:26 -0400
Subject: [PATCH 28/38] Update plan.MD

pivots
---
 plan.MD | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/plan.MD b/plan.MD
index 134791b..e6fbd03 100644
--- a/plan.MD
+++ b/plan.MD
@@ -99,10 +99,10 @@ Add one new Fastify route: `POST /api/evals/chat`.
 
 It mirrors `/api/agent` but:
 
-1. Calls `agent.generate()` instead of `agent.stream()` — blocks until the agent finishes, returns
-   `AgentRunResult`
+1. Calls `testAgentService.runTest()` (same service used by `nao test`) — non-streaming, returns
+   the full `AgentRunResult`
 2. Returns JSON `{ text, model_id, tool_results }` instead of a UI message stream
-3. `tool_results` = `steps.flatMap(s => s.toolResults)` filtered to content-bearing tools:
+3. `tool_results` = `TestAgentService.extractToolCalls(result)` filtered to content-bearing tools:
    `execute_sql`, `read`, `grep`
    — `list` and `search` are intentionally excluded: both return only file metadata
    (path/dir/size), not file content. They are navigation tools — the agent uses them to decide
@@ -192,11 +192,9 @@ for each record in golden_dataset.jsonl:
        retrieval_context = curated_context,
        expected_output   = record.expected_output   # used by completeness later
      )
-  5. assert_test(test_case, metrics=[
-       faithfulness_metric,
-       context_relevance_metric,
-       answer_relevance_metric,
-     ])
+  5. for each metric: metric.measure(test_case)   # direct call — owns the instance, reads score/reason
+     record result in session_results
+     raise AssertionError if any metric failed
 ```
 
 All three metrics run concurrently on the same test case. DeepEval reports each score and reason
@@ -230,7 +228,7 @@ agent. `conftest.py` instantiates the judge from that model ID:
 
 ```python
 FaithfulnessMetric(threshold=0.7, model=judge, include_reason=True)
-ContextualRelevancyMetric(threshold=0.7, model=judge, include_reason=True)
+ContextualRelevancyMetric(threshold=0.5, model=judge, include_reason=True)
 AnswerRelevancyMetric(threshold=0.7, model=judge, include_reason=True)
 ```
 
@@ -273,6 +271,10 @@ response `model_id`. No code changes needed.
 - **`llm_test_case`** — serializes tool results into `retrieval_context`, returns `LLMTestCase`.
 - **`faithfulness_metric`**, **`context_relevance_metric`**, **`answer_relevance_metric`** — metric
   instances using the judge model.
+- **`session_results`** — session-scoped list collecting `{id, input, actual_output, passed, metrics}`
+  for each record. Populated in `test_evals_triad.py` after each `metric.measure()` call.
+- **`pytest_sessionfinish`** — writes `session_results` to `NAO_EVALS_OUTPUT_FILE` (set by the runner
+  to `{project}/tests/outputs/evals_{timestamp}.json`) at the end of the session.
 - **`completeness_metric`** — add later; `GEval` with custom rubric (see below).
 
 ---
@@ -365,8 +367,12 @@ This gives a weekly regression signal without gating every PR.
 2. **Register route** in `apps/backend/src/app.ts` under `/api/evals`
 3. **Write `cli/nao_core/evals/conftest.py`** — fixtures, judge model factory, tool result
    serializer, dataset loader via `NAO_EVALS_DIR`
-4. **Write `cli/nao_core/evals/test_evals_triad.py`** — single parametrized test,
-   `assert_test(test_case, metrics=[...])`
+4. **Write `cli/nao_core/evals/test_evals_triad.py`** — single parametrized test; calls
+   `metric.measure(test_case)` per metric directly (not `assert_test`) so scores and reasons are
+   readable on the same instances; appends to `session_results`; raises `AssertionError` on failure
+   so that `nao evals` exits non-zero and pytest reports `FAILED` when thresholds aren't met —
+   consistent with `nao test`, which calls `sys.exit(1)` on any test failure. The JSON output
+   is always written regardless of pass/fail.
 5. **Write `cli/nao_core/commands/evals/runner.py`** — `nao evals` command mirroring `nao test`;
    `-m`, `-u`, `--password`, `-v`, `-s` flags; authenticates via `get_auth_session`, sets env vars,
    invokes pytest

From b73ed8f426aedcc41ce8ace4c6c75d28c9a808ba Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 22:36:15 -0400
Subject: [PATCH 29/38] Update plan.MD

less drama
---
 plan.MD | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/plan.MD b/plan.MD
index e6fbd03..f4c2113 100644
--- a/plan.MD
+++ b/plan.MD
@@ -349,14 +349,6 @@ diluting the score, consistent with the structural dilution described above).
 
 Do not block PRs on evals — they are slow (~5–10 s per LLM judge call) and cost money.
 
-Add `.github/workflows/evals.yml`:
-- Trigger: `workflow_dispatch` (manual) + `schedule: cron: '0 6 * * 1'` (weekly, Monday 06:00 UTC)
-- Spins up nao with a test project pointed at a fixture database
-- Runs the eval suite
-- Publishes scores as a job summary table (record ID, score, reason for any failures)
-
-This gives a weekly regression signal without gating every PR.
-
 ---
 
 ## Implementation Sequence

From 18355d1a27c484e9a3c8fadfe42e7011f8f24e6b Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Thu, 25 Jun 2026 22:38:17 -0400
Subject: [PATCH 30/38] Update plan.MD

even less drama
---
 plan.MD | 1 -
 1 file changed, 1 deletion(-)

diff --git a/plan.MD b/plan.MD
index f4c2113..f217fe1 100644
--- a/plan.MD
+++ b/plan.MD
@@ -371,7 +371,6 @@ Do not block PRs on evals — they are slow (~5–10 s per LLM judge call) and c
 6. **Register command** in `cli/nao_core/commands/__init__.py` and `cli/nao_core/main.py`
 7. **Add `evals` extra** to `cli/pyproject.toml` — `deepeval>=2.0` only
 8. **Populate `{project}/tests/evals/golden_dataset.jsonl`** — Q&A pairs from golden dataset
-9. **Wire CI** — add `.github/workflows/evals.yml`
 
 ---
 

From 55ccc0e1f34f17a2013d57004d2e416b64056890 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 11:53:33 -0400
Subject: [PATCH 31/38] Update thoughts.md

Metric strategy: referenceless vs reference-based
---
 demo/thoughts.md | 72 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 62 insertions(+), 10 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 05d1588..45ebdc8 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -25,17 +25,58 @@ We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this pro
 
 ---
 
-## Framework options
+### Metric strategy: referenceless vs reference-based
+
+Two approaches, not mutually exclusive.
+
+**Option A — Referenceless: RAG triad (Faithfulness, Contextual Relevancy, Answer Relevancy)**
+
+The RAG triad is a proxy framework. It assumes correctness flows downstream from three conditions:
+(1) the context retrieved was relevant to the question, (2) the answer was grounded in that context,
+and (3) the answer addressed what was asked. If all three hold, a correct answer is likely — but
+it is never directly verified. The triad deliberately dodges the hard problem: does the actual output
+match ground truth? The reason it does so is that ground truth is assumed to be expensive and
+ambiguous to define.
+
+For nao this means the triad catches hallucinations and off-topic answers but will not catch a
+response that is faithful, relevant, and still factually wrong — for example, an answer grounded in
+context that itself contains a stale or incorrect value.
+
+**Option B — Reference-based: Correctness (GEval)**
+
+Compares the actual output directly against the expected output. Requires a `expected_output` for
+every golden record. Catches factual errors the triad misses — if the number is wrong, the score
+drops regardless of how grounded the answer is.
+
+```python
+from deepeval.metrics import GEval
+from deepeval.test_case import LLMTestCaseParams
+
+correctness_metric = GEval(
+    name="Correctness",
+    evaluation_steps=[
+        "Compare the actual output directly with the expected output to verify factual accuracy.",
+        "Check if all elements mentioned in the expected output are present and correctly represented in the actual output.",
+        "Assess if there are any discrepancies in details, values, or information between the actual and expected outputs.",
+    ],
+    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
+)
+```
 
-| | DeepEval | Latitude | In-house |
-|--|----------|----------|----------|
-| **Approach** | Pre-built generic metric library | Evals derived from production failures and expert judgment | Hand-rolled judge prompts |
-| **Best for** | Pre-production unit testing — useful when there is no production traffic yet | Post-production — requires real failure data to work from | Full control, no dependencies |
-| **Main value** | No prompt engineering required; plug in metrics and run | Evals grounded in actual user failures, not generic rubrics | Fully tailored to the use case |
-| **Main risk** | Score quality depends on golden dataset quality — the framework is only as good as the reference answers | Not useful pre-production; concedes this itself | Requires prompt engineering expertise; hard to maintain |
-| **Multi-turn path** | Built in | Built in | Rebuild from scratch |
+The cost: `expected_output` must be maintained. For nao's SQL-answerable questions this is low
+— the expected answer can be generated by running the SQL directly. For open-ended or exploratory
+questions it is harder to define and more likely to drift as the data changes.
 
-**Why this matters for nao:** we are targeting reference-based evals and assume end users arrive with a golden dataset. This changes the comparison significantly. The "generic metrics lie" concern — the main argument against DeepEval — is largely neutralised when every entry has a `reference_answer`: the judge is not scoring in the abstract, it is comparing against a concrete expected output. Latitude's value proposition (evals grounded in real production failures) does not apply here; a golden dataset replaces the need for production traffic. In-house also becomes more viable since comparing against a reference answer is a simpler judge prompt than scoring on abstract rubrics — but DeepEval still wins on setup cost and the multi-turn path for the noa labs team.
+**Recommended: both together**
+
+The triad and Correctness are complementary. The triad diagnoses *why* something failed (wrong
+context, hallucination, off-topic). Correctness catches *whether* it failed. Running both gives a
+richer signal: a response can pass Faithfulness but fail Correctness (grounded in context but the
+context was wrong), or pass Correctness but fail Faithfulness (lucky correct answer not actually
+supported by what the agent saw).
+
+The `expected_output` field is already in every golden record. Adding Correctness requires one
+additional fixture in `conftest.py` and one line in the `metrics=[]` list — no schema changes.
 
 ---
 
@@ -52,10 +93,21 @@ RAG traid tests relationship b/w three entities: Question, Context and Response
 | Answer relevance | input and actual_output | Does the response address what the user actually asked? |
 | Completeness | input and actual_output | Did the response cover the full scope of the question at the right level of detail? |
 
-So the challenge will be to attache curated context to the test at runtime. Other than that, the triad itself seems very close to what nao users might want to accomplish with their evals.
+---
+
+## Framework options
 
+| | DeepEval | Latitude | In-house |
+|--|----------|----------|----------|
+| **Approach** | Pre-built generic metric library | Evals derived from production failures and expert judgment | Hand-rolled judge prompts |
+| **Best for** | Pre-production unit testing — useful when there is no production traffic yet | Post-production — requires real failure data to work from | Full control, no dependencies |
+| **Main value** | No prompt engineering required; plug in metrics and run | Evals grounded in actual user failures, not generic rubrics | Fully tailored to the use case |
+| **Main risk** | Score quality depends on golden dataset quality — the framework is only as good as the reference answers | Not useful pre-production; concedes this itself | Requires prompt engineering expertise; hard to maintain |
+| **Multi-turn path** | Built in | Built in | Rebuild from scratch |
 
+**Why this matters for nao:** we are targeting reference-based evals and assume end users arrive with a golden dataset. This changes the comparison significantly. The "generic metrics lie" concern — the main argument against DeepEval — is largely neutralised when every entry has a `reference_answer`: the judge is not scoring in the abstract, it is comparing against a concrete expected output. Latitude's value proposition (evals grounded in real production failures) does not apply here; a golden dataset replaces the need for production traffic. In-house also becomes more viable since comparing against a reference answer is a simpler judge prompt than scoring on abstract rubrics — but DeepEval still wins on setup cost and the multi-turn path for the noa labs team.
 
+---
 
 ## Do not use the rest
 ---

From 90bb57f7d9fcb58d0e8cf1621b04eecebe9f4451 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 11:55:33 -0400
Subject: [PATCH 32/38] Update thoughts.md

reference-based as a choice
---
 demo/thoughts.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 45ebdc8..51459dc 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -13,8 +13,6 @@ We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this pro
 
 **Single-turn** — each eval entry is a single, atomic unit of interaction with the LLM app: one user input, one assistant response, no conversation history. The assistant is evaluated on what it says in that one reply, in isolation.
 
-**Reference-based** — every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in whether the agent arrived at the expected answer rather than asking it to reason from the rubric alone.
-
 **End-to-end** — we evaluate the observable input and output of the Chat BI system and treat it as a black box. We do not instrument internal steps — no retrieval spans, no tool call traces, no sub-agent scoring. We care about the result the user sees, not the path the system took to produce it. This is the right fit for context-change evals: if the answer improved, the context change worked, regardless of what happened inside.
 
 **Local-first** — the eval harness runs entirely on the developer's machine, with no external eval platform or cloud service required. The golden dataset, judge prompts, scores, and results all live in this repository. This keeps the feedback loop fast, keeps data private, and means the eval is as easy to run as any other dbt command.
@@ -44,6 +42,8 @@ context that itself contains a stale or incorrect value.
 
 **Option B — Reference-based: Correctness (GEval)**
 
+Every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in whether the agent arrived at the expected answer rather than asking it to reason from the rubric alone.
+
 Compares the actual output directly against the expected output. Requires a `expected_output` for
 every golden record. Catches factual errors the triad misses — if the number is wrong, the score
 drops regardless of how grounded the answer is.

From 277702d7fbcaef4b635a41287764045d5fd953a3 Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 12:00:03 -0400
Subject: [PATCH 33/38] Update thoughts.md

cuarated context explained
---
 demo/thoughts.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 51459dc..c972c7d 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -93,6 +93,10 @@ RAG traid tests relationship b/w three entities: Question, Context and Response
 | Answer relevance | input and actual_output | Does the response address what the user actually asked? |
 | Completeness | input and actual_output | Did the response cover the full scope of the question at the right level of detail? |
 
+**How curated context is captured at runtime**
+
+nao does not have a retrieval step — its context is the set of tool calls the agent made while composing its answer. Tool calls that return actual content (e.g. `execute_sql`, `read`, `grep`) become the curated context passed to the judge. Tool calls that return only file metadata (e.g. `list`, `search`) are excluded — they are navigation, not grounding.
+
 ---
 
 ## Framework options

From de1fc04ca592e4b8282ae83ebc91dbfe5da0a40e Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 12:23:09 -0400
Subject: [PATCH 34/38] Update thoughts.md

RAG example
---
 demo/thoughts.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index c972c7d..612e40e 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -40,6 +40,14 @@ For nao this means the triad catches hallucinations and off-topic answers but wi
 response that is faithful, relevant, and still factually wrong — for example, an answer grounded in
 context that itself contains a stale or incorrect value.
 
+```python
+from deepeval.metrics import FaithfulnessMetric, ContextualRelevancyMetric, AnswerRelevancyMetric
+
+faithfulness_metric        = FaithfulnessMetric(threshold=0.7, model=judge, include_reason=True)
+context_relevancy_metric   = ContextualRelevancyMetric(threshold=0.5, model=judge, include_reason=True)
+answer_relevancy_metric    = AnswerRelevancyMetric(threshold=0.7, model=judge, include_reason=True)
+```
+
 **Option B — Reference-based: Correctness (GEval)**
 
 Every entry includes a `reference_answer` that describes what a correct response looks like. The judge uses this as the gold standard — not to demand an exact match, but to assess whether the actual response aligns with the same intent, facts, and format. This grounds the judge's scoring in whether the agent arrived at the expected answer rather than asking it to reason from the rubric alone.

From 02a818f89760efff5983411b896b340a84b2752b Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 12:33:05 -0400
Subject: [PATCH 35/38] Update thoughts.md

clarify the difference in constructing
---
 demo/thoughts.md | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 612e40e..4e0c946 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -40,8 +40,20 @@ For nao this means the triad catches hallucinations and off-topic answers but wi
 response that is faithful, relevant, and still factually wrong — for example, an answer grounded in
 context that itself contains a stale or incorrect value.
 
+These are purpose-built metrics — DeepEval already knows which `LLMTestCase` fields each one needs:
+- `FaithfulnessMetric` → `input`, `actual_output`, `retrieval_context`
+- `ContextualRelevancyMetric` → `input`, `retrieval_context`
+- `AnswerRelevancyMetric` → `input`, `actual_output`
+
 ```python
 from deepeval.metrics import FaithfulnessMetric, ContextualRelevancyMetric, AnswerRelevancyMetric
+from deepeval.test_case import LLMTestCase
+
+test_case = LLMTestCase(
+    input="How many ports are currently decommissioned?",
+    actual_output="There are 4 decommissioned ports.",
+    retrieval_context=["[SQL result]\ndecommissioned_ports: 4"],
+)
 
 faithfulness_metric        = FaithfulnessMetric(threshold=0.7, model=judge, include_reason=True)
 context_relevancy_metric   = ContextualRelevancyMetric(threshold=0.5, model=judge, include_reason=True)
@@ -56,9 +68,17 @@ Compares the actual output directly against the expected output. Requires a `exp
 every golden record. Catches factual errors the triad misses — if the number is wrong, the score
 drops regardless of how grounded the answer is.
 
+`GEval` is a blank-slate metric — you write the evaluation steps yourself, so you must explicitly declare which `LLMTestCase` fields to pass via `evaluation_params`.
+
 ```python
 from deepeval.metrics import GEval
-from deepeval.test_case import LLMTestCaseParams
+from deepeval.test_case import LLMTestCase, LLMTestCaseParams
+
+test_case = LLMTestCase(
+    input="How many ports are currently decommissioned?",
+    actual_output="There are 4 decommissioned ports.",
+    expected_output="There are 4 decommissioned ports.",
+)
 
 correctness_metric = GEval(
     name="Correctness",

From 8195d26b7628bddbcd623617bcfacc016e31581c Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 12:47:14 -0400
Subject: [PATCH 36/38] Update plan.MD

match expected behavior from https://github.com/getnao/nao/issues/651

  A tests/ or agent/tests/ directory convention where data teams define test cases: input query + expected behavior (correct table referenced, correct metric, no hallucination, etc.).
  nao test runs the suite, reports pass/fail per case, and outputs a summary score.
  CI-friendly output (exit code, JSON report).
---
 plan.MD | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/plan.MD b/plan.MD
index f217fe1..9d3bc08 100644
--- a/plan.MD
+++ b/plan.MD
@@ -273,8 +273,9 @@ response `model_id`. No code changes needed.
   instances using the judge model.
 - **`session_results`** — session-scoped list collecting `{id, input, actual_output, passed, metrics}`
   for each record. Populated in `test_evals_triad.py` after each `metric.measure()` call.
-- **`pytest_sessionfinish`** — writes `session_results` to `NAO_EVALS_OUTPUT_FILE` (set by the runner
-  to `{project}/tests/outputs/evals_{timestamp}.json`) at the end of the session.
+- **`pytest_sessionfinish`** — writes `session_results` plus a `summary: { total, passed, failed }`
+  to `NAO_EVALS_OUTPUT_FILE` (set by the runner to `{project}/tests/outputs/evals_{timestamp}.json`)
+  at the end of the session.
 - **`completeness_metric`** — add later; `GEval` with custom rubric (see below).
 
 ---
@@ -334,7 +335,12 @@ Results are written to `{project}/tests/outputs/evals_{timestamp}.json`:
         { "name": "Answer Relevancy",      "score": 0.8182, "threshold": 0.7, "passed": true,  "reason": "The score is 0.82 because the output addresses overall uptime but includes unnecessary detail about which chargers caused the downtime events — not directly relevant to the question asked." }
       ]
     }
-  ]
+  ],
+  "summary": {
+    "total": 2,
+    "passed": 2,
+    "failed": 0
+  }
 }
 ```
 
@@ -367,7 +373,8 @@ Do not block PRs on evals — they are slow (~5–10 s per LLM judge call) and c
    is always written regardless of pass/fail.
 5. **Write `cli/nao_core/commands/evals/runner.py`** — `nao evals` command mirroring `nao test`;
    `-m`, `-u`, `--password`, `-v`, `-s` flags; authenticates via `get_auth_session`, sets env vars,
-   invokes pytest
+   invokes pytest; after pytest exits, reads `summary` from the JSON output and calls `sys.exit(1)`
+   if `failed > 0` — same pattern as `nao test`
 6. **Register command** in `cli/nao_core/commands/__init__.py` and `cli/nao_core/main.py`
 7. **Add `evals` extra** to `cli/pyproject.toml` — `deepeval>=2.0` only
 8. **Populate `{project}/tests/evals/golden_dataset.jsonl`** — Q&A pairs from golden dataset

From 8d29afaa7dc026d7de2b38f813e77fe87ebbe05c Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 12:52:21 -0400
Subject: [PATCH 37/38] Update thoughts.md

open questions and acceptance criteria
---
 demo/thoughts.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 4e0c946..823982a 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -5,6 +5,21 @@
 **Goal** — implement an evals framework that quantifies the impact of context changes on the nao Chat BI tool. We are not testing the LLM or general chat performance. We are testing one specific thing: did a change to the context — RULES.md, semantic model definitions, or similar input files — make the assistant's answers better or worse? The eval score is a signal for context quality, not model quality. Nao already has SQL tests in place that guard against schema linking failures and semantic gaps. We are looking to add non-deterministic evals that catch failures when the SQL and even the number is correct.
 
 
+## Open questions
+
+- [ ] Referenceless (RAG triad) or reference-based (Correctness), or both?
+- [ ] Use DeepEval's built-in metrics or maintain custom judge prompts?
+
+## Acceptance criteria
+
+- [ ] Data teams define test cases in `tests/evals/` alongside existing `tests/*.yml` SQL tests
+- [ ] `nao evals` runs from a project directory and produces a JSON report in `tests/outputs/`
+- [ ] Report includes pass/fail per test case, per-metric scores and reasons, and a summary
+- [ ] Exit code is non-zero when any case fails — `nao evals` is CI-friendly
+- [ ] Golden dataset lives in `tests/evals/golden_dataset.jsonl` — no boilerplate beyond `id`, `input`, `expected_output`
+
+---
+
 ## Design choices
 
 We are adding **LLM-as-a-judge, single-turn, reference-based evals** to this project. Here is what each term means:

From 9df91ee39b4d109b0bbd654b615e41deea43c3aa Mon Sep 17 00:00:00 2001
From: Daria <daria@kwwhat.com>
Date: Fri, 26 Jun 2026 12:53:55 -0400
Subject: [PATCH 38/38] Update thoughts.md

tools
---
 demo/thoughts.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/demo/thoughts.md b/demo/thoughts.md
index 823982a..ac5d0e7 100644
--- a/demo/thoughts.md
+++ b/demo/thoughts.md
@@ -9,6 +9,7 @@
 
 - [ ] Referenceless (RAG triad) or reference-based (Correctness), or both?
 - [ ] Use DeepEval's built-in metrics or maintain custom judge prompts?
+- [ ] RAG triad requires nao to maintain a list of content-bearing tools (`execute_sql`, `read`, `grep`) — acceptable ongoing cost? If a new tool is added (e.g. `fetch_api`, `query_vector_store`), the team must decide whether its output is grounding content or navigation metadata and update the list.
 
 ## Acceptance criteria