docs(demo): baseline for Stage 2 by VoxelPrincess · Pull Request #90 · appspace/kwwhat

VoxelPrincess · 2026-05-13T16:47:45Z

No description provided.

just thought nothing is finalized more like food for thoughts

criteria

daria-sukhareva · 2026-06-11T15:00:18Z

+
+### Notes
+
+- `primary_context` is a reviewer traceability pointer — not injected into the judge prompt.


I think this is maybe all we need https://deepeval.com/docs/metrics-answer-relevancy

Faithfulness also seems off https://deepeval.com/docs/metrics-faithfulness as it evaluates answer agains provided context, where in our case context is provided to the chat not to the test

unless we provide reference context which might be a lot.

AnswerRelevancyMetric (only needs input + actual_output)
FaithfulnessMetric (needs actual_output + retrieval_context, which your RAG pipeline already produces) - we have context but not retrieved, we have curated context!

https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/answer_relevancy/template.py judging by the prompt we can end up with absolutely relevant answer that is completely wrong in relation to the context - hallucinated relevant answer

we need faithfulness too and F takes context as an input and context should be attached identical to what was used to generate actual_answer - so it’s something nao will have to attach the same way as for the regular convo

more thoughts

goal

wording

compare eval framework options

end-to-end

metrics shortlist

Expanded the goal description for the evals framework to include details about existing SQL tests and the addition of non-deterministic evals.

Added section on design choices for evals framework.

Updated metrics section to include RAG triad methodology and clarified evaluation criteria for curated context.

Updated the inputs for the Faithfulness metric to include 'input'.

…e/kwwhat into chat-evals-stabilization

what we learned from running evals: exclude `list` from a context fetching tools, lower context relevancy threshold to account for broader curated context

added example output

daria-sukhareva · 2026-06-25T16:17:41Z

+
+Results are written to `{project}/tests/outputs/evals_{timestamp}.json`:
+
+```json


let me know what you think @VoxelPrincess

removed search + fixed tests

Clarified the context description for evaluations by removing the word 'faithfulness' and specifying the tools included and excluded.

exclude search

maintenance burden

pivots

less drama

even less drama

Metric strategy: referenceless vs reference-based

reference-based as a choice

cuarated context explained

RAG example

clarify the difference in constructing

match expected behavior from getnao/nao#651 A tests/ or agent/tests/ directory convention where data teams define test cases: input query + expected behavior (correct table referenced, correct metric, no hallucination, etc.). nao test runs the suite, reports pass/fail per case, and outputs a summary score. CI-friendly output (exit code, JSON report).

open questions and acceptance criteria

tools

docs(demo): baseline for Stage 2

b839e7f

VoxelPrincess requested a review from daria-sukhareva May 13, 2026 16:47

VoxelPrincess self-assigned this May 13, 2026

VoxelPrincess and others added 8 commits May 13, 2026 19:49

docs(demo): updates for Stage 2

d1739a1

entry format and prompts

faf58dd

just thought nothing is finalized more like food for thoughts

Update thoughts.md

49a95d9

criteria

294887c

criteria

why it matters

d3cdb85

where sql might fit in

944fcf5

completeness wording

97f78c2

tradeoffs to consider

03b0d7c

daria-sukhareva reviewed Jun 11, 2026

View reviewed changes

daria-sukhareva and others added 15 commits June 12, 2026 11:43

Update thoughts.md

131609a

more thoughts

Update thoughts.md

2438320

goal

Update thoughts.md

8d12a27

wording

Update thoughts.md

eaab631

compare eval framework options

Update thoughts.md

9cf5139

end-to-end

Update thoughts.md

56bfc1b

metrics shortlist

Enhance goal description for evals framework

87d4724

Expanded the goal description for the evals framework to include details about existing SQL tests and the addition of non-deterministic evals.

Document design choices for evals framework

df3a1d0

Added section on design choices for evals framework.

Revise metrics and evaluation framework in thoughts.md

e8d6be6

Updated metrics section to include RAG triad methodology and clarified evaluation criteria for curated context.

Modify inputs for Faithfulness metric in thoughts.md

680b780

Updated the inputs for the Faithfulness metric to include 'input'.

Document demo eval test constraints

6bee620

plan.MD

8cd3093

Merge branch 'chat-evals-stabilization' of https://github.com/appspac…

45b6c67

…e/kwwhat into chat-evals-stabilization

Update plan.MD

d26cd75

what we learned from running evals: exclude `list` from a context fetching tools, lower context relevancy threshold to account for broader curated context

Update plan.MD

ad39fcc

added example output

daria-sukhareva reviewed Jun 25, 2026

View reviewed changes

daria-sukhareva added 2 commits June 25, 2026 21:38

Update plan.MD

58ee3a3

removed search + fixed tests

Update curated context description for evaluations

b89b4f7

Clarified the context description for evaluations by removing the word 'faithfulness' and specifying the tools included and excluded.

daria-sukhareva added 13 commits June 25, 2026 22:09

Update plan.MD

06597d4

exclude search

Update plan.MD

d54f833

maintenance burden

Update plan.MD

b914765

pivots

Update plan.MD

b73ed8f

less drama

Update plan.MD

18355d1

even less drama

Update thoughts.md

55ccc0e

Metric strategy: referenceless vs reference-based

Update thoughts.md

90bb57f

reference-based as a choice

Update thoughts.md

277702d

cuarated context explained

Update thoughts.md

de1fc04

RAG example

Update thoughts.md

02a818f

clarify the difference in constructing

Update thoughts.md

8d29afa

open questions and acceptance criteria

Update thoughts.md

9df91ee

tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(demo): baseline for Stage 2#90

docs(demo): baseline for Stage 2#90
VoxelPrincess wants to merge 39 commits into
mainfrom
chat-evals-stabilization

VoxelPrincess commented May 13, 2026

Uh oh!

daria-sukhareva Jun 11, 2026

Uh oh!

daria-sukhareva Jun 12, 2026

Uh oh!

daria-sukhareva Jun 12, 2026

Uh oh!

daria-sukhareva Jun 12, 2026 •

edited

Loading

Uh oh!

daria-sukhareva Jun 12, 2026

Uh oh!

daria-sukhareva Jun 12, 2026

Uh oh!

daria-sukhareva Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Notes

		- `primary_context` is a reviewer traceability pointer — not injected into the judge prompt.


		Results are written to `{project}/tests/outputs/evals_{timestamp}.json`:

		```json

Uh oh!

Conversation

VoxelPrincess commented May 13, 2026

Uh oh!

daria-sukhareva Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

daria-sukhareva Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

daria-sukhareva Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

daria-sukhareva Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daria-sukhareva Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

daria-sukhareva Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

daria-sukhareva Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daria-sukhareva Jun 12, 2026 •

edited

Loading