docs(demo): baseline for Stage 2#90
Conversation
just thought nothing is finalized more like food for thoughts
|
|
||
| ### Notes | ||
|
|
||
| - `primary_context` is a reviewer traceability pointer — not injected into the judge prompt. |
There was a problem hiding this comment.
I think this is maybe all we need https://deepeval.com/docs/metrics-answer-relevancy
There was a problem hiding this comment.
Faithfulness also seems off https://deepeval.com/docs/metrics-faithfulness as it evaluates answer agains provided context, where in our case context is provided to the chat not to the test
There was a problem hiding this comment.
unless we provide reference context which might be a lot.
There was a problem hiding this comment.
AnswerRelevancyMetric (only needs input + actual_output)
FaithfulnessMetric (needs actual_output + retrieval_context, which your RAG pipeline already produces) - we have context but not retrieved, we have curated context!
There was a problem hiding this comment.
https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/answer_relevancy/template.py judging by the prompt we can end up with absolutely relevant answer that is completely wrong in relation to the context - hallucinated relevant answer
There was a problem hiding this comment.
we need faithfulness too and F takes context as an input and context should be attached identical to what was used to generate actual_answer - so it’s something nao will have to attach the same way as for the regular convo
more thoughts
wording
compare eval framework options
end-to-end
metrics shortlist
Expanded the goal description for the evals framework to include details about existing SQL tests and the addition of non-deterministic evals.
Added section on design choices for evals framework.
Updated metrics section to include RAG triad methodology and clarified evaluation criteria for curated context.
Updated the inputs for the Faithfulness metric to include 'input'.
…e/kwwhat into chat-evals-stabilization
what we learned from running evals: exclude `list` from a context fetching tools, lower context relevancy threshold to account for broader curated context
added example output
|
|
||
| Results are written to `{project}/tests/outputs/evals_{timestamp}.json`: | ||
|
|
||
| ```json |
removed search + fixed tests
Clarified the context description for evaluations by removing the word 'faithfulness' and specifying the tools included and excluded.
exclude search
maintenance burden
pivots
less drama
even less drama
Metric strategy: referenceless vs reference-based
reference-based as a choice
cuarated context explained
RAG example
clarify the difference in constructing
match expected behavior from getnao/nao#651 A tests/ or agent/tests/ directory convention where data teams define test cases: input query + expected behavior (correct table referenced, correct metric, no hallucination, etc.). nao test runs the suite, reports pass/fail per case, and outputs a summary score. CI-friendly output (exit code, JSON report).
open questions and acceptance criteria
tools
No description provided.