Skip to content

docs(demo): baseline for Stage 2#90

Draft
VoxelPrincess wants to merge 39 commits into
mainfrom
chat-evals-stabilization
Draft

docs(demo): baseline for Stage 2#90
VoxelPrincess wants to merge 39 commits into
mainfrom
chat-evals-stabilization

Conversation

@VoxelPrincess

Copy link
Copy Markdown
Collaborator

No description provided.

@VoxelPrincess VoxelPrincess self-assigned this May 13, 2026
Comment thread demo/thoughts.md

### Notes

- `primary_context` is a reviewer traceability pointer — not injected into the judge prompt.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is maybe all we need https://deepeval.com/docs/metrics-answer-relevancy

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faithfulness also seems off https://deepeval.com/docs/metrics-faithfulness as it evaluates answer agains provided context, where in our case context is provided to the chat not to the test

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we provide reference context which might be a lot.

@daria-sukhareva daria-sukhareva Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnswerRelevancyMetric (only needs input + actual_output)
FaithfulnessMetric (needs actual_output + retrieval_context, which your RAG pipeline already produces) - we have context but not retrieved, we have curated context!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/answer_relevancy/template.py judging by the prompt we can end up with absolutely relevant answer that is completely wrong in relation to the context - hallucinated relevant answer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need faithfulness too and F takes context as an input and context should be attached identical to what was used to generate actual_answer - so it’s something nao will have to attach the same way as for the regular convo

daria-sukhareva and others added 15 commits June 12, 2026 11:43
more thoughts
compare eval framework options
metrics shortlist
Expanded the goal description for the evals framework to include details about existing SQL tests and the addition of non-deterministic evals.
Added section on design choices for evals framework.
Updated metrics section to include RAG triad methodology and clarified evaluation criteria for curated context.
Updated the inputs for the Faithfulness metric to include 'input'.
what we learned from running evals: exclude `list` from a context fetching tools, lower context relevancy threshold to account for broader curated context
added example output
Comment thread plan.MD

Results are written to `{project}/tests/outputs/evals_{timestamp}.json`:

```json

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know what you think @VoxelPrincess

removed search + fixed tests
Clarified the context description for evaluations by removing the word 'faithfulness' and specifying the tools included and excluded.
exclude search
maintenance burden
less drama
even less drama
Metric strategy: referenceless vs reference-based
reference-based as a choice
cuarated context explained
RAG example
clarify the difference in constructing
match expected behavior from getnao/nao#651

  A tests/ or agent/tests/ directory convention where data teams define test cases: input query + expected behavior (correct table referenced, correct metric, no hallucination, etc.).
  nao test runs the suite, reports pass/fail per case, and outputs a summary score.
  CI-friendly output (exit code, JSON report).
open questions and acceptance criteria
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants