From b5336d86bcbd6f4a046dce2d4d4ccefa0c0a3917 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 27 Jun 2026 12:45:47 +0200 Subject: [PATCH 1/2] docs(eval): clarify eval authoring contracts --- .../docs/docs/evaluation/eval-cases.mdx | 56 ++++++++++++++----- .../docs/docs/evaluation/eval-files.mdx | 16 ++++-- .../docs/docs/guides/eval-authoring.mdx | 26 +++++++++ .../agentv-bench/references/eval-yaml-spec.md | 16 +++++- skills-data/agentv-eval-review/SKILL.md | 5 +- skills-data/agentv-eval-writer/SKILL.md | 50 +++++++++++++++-- 6 files changed, 142 insertions(+), 27 deletions(-) diff --git a/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx b/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx index e5db2e8f1..8f19727c5 100644 --- a/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx +++ b/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx @@ -12,11 +12,11 @@ Tests are individual test entries within an evaluation file. Each test defines i ```yaml tests: - id: addition - criteria: Correctly calculates 15 + 27 = 42 - input: What is 15 + 27? expected_output: "42" + assertions: + - The answer is exactly 42 ``` ## Fields @@ -24,7 +24,7 @@ tests: | Field | Required | Description | |-------|----------|-------------| | `id` | Yes | Unique identifier for the test | -| `criteria` | Yes | Description of what a correct response should contain | +| `criteria` | Conditional | Description of what a correct response should contain. Required only when the case has no `expected_output` or `assertions` | | `input` | Yes | Input sent to the target (string, object, or message array) | | `expected_output` | No | Expected response for comparison (string, object, or message array) | | `execution` | No | Per-case execution overrides (for example `target`, `skip_defaults`) | @@ -67,11 +67,13 @@ When suite-level `input` is defined in the eval file, those messages are prepend ## Expected Output -Optional reference response for comparison by graders. `expected_output` is passive reference -data: it is stored on the case and passed to graders, but it does not choose a grader by -itself when `assertions` is present. Add an explicit `llm-grader`, `code-grader`, -`field-accuracy`, or another reference-aware grader when you want the reference answer -evaluated. +Optional reference response for comparison by graders. Write `expected_output` as +a golden answer or reference response the target could have produced, not as a +rubric or "the agent should..." criteria list. `expected_output` is passive +reference data: it is stored on the case and passed to graders, but it does not +choose a grader by itself when `assertions` is present. Add explicit assertion +strings, `llm-grader`, `code-grader`, `field-accuracy`, or another +reference-aware grader when you want the reference answer evaluated. A string expands to a single assistant message: @@ -164,7 +166,6 @@ Pass arbitrary key-value pairs to lifecycle commands via the `metadata` field. T ```yaml tests: - id: sympy-20590 - criteria: Bug should be fixed input: Fix the diophantine equation bug in repo/. metadata: source_repo: sympy/sympy @@ -182,6 +183,17 @@ tests: The `metadata` field is included in the stdin JSON passed to lifecycle commands as `case_metadata`. Operational checkout state belongs under `workspace.repos[].base_commit`; matching metadata fields such as `source_commit` are informational only. +For historical repo-state evals, pin the checkout under `workspace.repos[]` +instead of only mentioning the SHA in prompt prose: + +```yaml +workspace: + repos: + - path: ./agentv + repo: https://github.com/EntityProcess/agentv.git + commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215 +``` + For benchmark task packs with source pins, patches, generated rows, and supporting files, see [Benchmark Provenance](/docs/guides/benchmark-provenance/). @@ -197,7 +209,6 @@ AgentV groups the strings into a rubric grader automatically: ```yaml tests: - id: bug-fix-review - criteria: Finds and fixes the bug input: Review this failing parser implementation. assertions: - Identifies the root cause of the parser failure @@ -206,7 +217,10 @@ tests: ``` Use this shape for qualitative requirements. It is less brittle than checking -for exact substrings in an agent response. +for exact substrings in an agent response. When these strings fully define the +grading contract, do not add a `criteria` field that repeats the same rubric. +Declare `type: llm-grader` explicitly only when you need a custom prompt, custom +grader target, or a deliberately separate grader panel. ### Deterministic Assertions @@ -232,7 +246,6 @@ Underscore variants (`contains_all`, `is_json`, etc.) are also accepted. ```yaml tests: - id: json-api - criteria: Returns valid JSON with status field input: Return the system status as JSON assertions: - type: is-json @@ -247,14 +260,12 @@ Use `contains-all` or `contains-any` to check multiple values in a single assert ```yaml tests: - id: required-fields - criteria: Response mentions all required fields input: "Confirm details: name is Alice, email is alice@example.com" assertions: - type: contains-all value: ["Alice", "alice@example.com"] - id: greeting-variant - criteria: Response includes some form of greeting input: "Greet the user warmly." assertions: - type: contains-any @@ -411,6 +422,23 @@ tests: value: "4" ``` +For contract-style evals where assertion strings express every semantic check, +omit `criteria`: + +```yaml +tests: + - id: verification-learning-capture + input: | + Decide what durable repo change should be made after a PR closeout + revealed reusable verification workflow lessons. + expected_output: | + The durable repo change is to update .agents/verification.md with the + reusable verification workflow lessons. + assertions: + - The answer recommends updating .agents/verification.md rather than leaving the learning only in PR comments or private evidence. + - The answer avoids preserving one-off observations as durable guidance. +``` + If `assertions` contains only deterministic graders (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted: ``` diff --git a/apps/web/src/content/docs/docs/evaluation/eval-files.mdx b/apps/web/src/content/docs/docs/evaluation/eval-files.mdx index 84a8bf437..3d7cb3aad 100644 --- a/apps/web/src/content/docs/docs/evaluation/eval-files.mdx +++ b/apps/web/src/content/docs/docs/evaluation/eval-files.mdx @@ -23,13 +23,11 @@ experiment: target: default assertions: - - name: correctness - type: llm-grader - prompt: ./graders/correctness.md + - Correctly calculates the answer + - Explains the calculation briefly tests: - id: addition - criteria: Correctly calculates 15 + 27 = 42 input: What is 15 + 27? expected_output: "42" ``` @@ -46,6 +44,11 @@ tests: | `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test | | `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test | +For historical or repo-state evals, put the checkout under +`workspace.repos[].commit` or `workspace.repos[].base_commit`. A commit SHA in +the prompt or metadata is useful context, but it does not materialize a repo for +the agent to inspect. + ### Metadata Fields You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the `name` field is present: @@ -82,6 +85,10 @@ The `assertions` field is the canonical way to define suite-level graders. Suite For semantic or agent-behavior checks, prefer plain assertion strings first; AgentV treats them as rubric criteria. Use deterministic assertions or code graders when the expected output is exact or requires programmatic inspection. +If the assertion strings already state the grading contract, omit a duplicate +`criteria` field on each test. Use explicit `type: llm-grader` entries only +when you need a custom prompt, a custom grader target, or a deliberately +separate grader panel. ```yaml description: API response validation @@ -95,7 +102,6 @@ assertions: tests: - id: health-check - criteria: Returns health status input: Check API health ``` diff --git a/apps/web/src/content/docs/docs/guides/eval-authoring.mdx b/apps/web/src/content/docs/docs/guides/eval-authoring.mdx index bdee29168..6dda5efbf 100644 --- a/apps/web/src/content/docs/docs/guides/eval-authoring.mdx +++ b/apps/web/src/content/docs/docs/guides/eval-authoring.mdx @@ -150,3 +150,29 @@ When you don't want to maintain actual diffs, describe the changes inline: ``` This avoids workspace state issues entirely — the agent evaluates the diff as presented without checking `git diff`. + +## Historical Repo State: Pin the Checkout + +If a test asks the agent to inspect how a repository looked at a past commit, +declare that checkout in `workspace.repos[]`. Do not rely on prompt prose that +mentions a SHA without materializing the repo. + +```yaml +workspace: + repos: + - path: ./agentv + repo: https://github.com/EntityProcess/agentv.git + commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215 + +tests: + - id: verification-learning-capture + input: | + The eval harness has prepared ./agentv at the historical commit. + Use that checkout to decide which durable guidance should change. + expected_output: | + The durable repo change is to update .agents/verification.md with the + reusable verification workflow lessons. + assertions: + - The answer uses the pinned ./agentv checkout to verify the existing guidance. + - The answer preserves the historical commit SHA as context. +``` diff --git a/skills-data/agentv-bench/references/eval-yaml-spec.md b/skills-data/agentv-bench/references/eval-yaml-spec.md index db71f1d88..b2285993a 100644 --- a/skills-data/agentv-bench/references/eval-yaml-spec.md +++ b/skills-data/agentv-bench/references/eval-yaml-spec.md @@ -10,7 +10,7 @@ The grader agent uses this to evaluate assertions without the CLI. - `name` (string, optional) — eval name - `description` (string, optional) — description - `execution` (object, optional) — `target`, `model`, etc. -- `workspace` (object, optional) — workspace config (template, hooks) +- `workspace` (object, optional) — workspace config (template, repos, hooks) - `input` (string | object | Message | Message[], optional) — suite-level input prepended to each test. String/block shorthand expands to a user message. - `tests` (array, required) — test cases @@ -24,6 +24,18 @@ The grader agent uses this to evaluate assertions without the CLI. - `conversation_id` (string, optional) — groups related tests - `execution` (object, optional) — per-test execution override +If `assertions` already state the grading contract, omit `criteria` instead of +duplicating the same rubric. Prefer plain assertion strings for semantic checks +when the default LLM rubric grader can judge them; use multiple named +`type: llm-grader` blocks only for custom prompts, custom grader targets, or +intentional grader panels. Write `expected_output` as a golden/reference answer, +not as criteria or scoring instructions. + +For historical or repo-state evals, materialize the repository under +`workspace.repos[]` and pin `commit` or `base_commit` to the commit under test. +A SHA in prompt prose or metadata is context only; it does not give the agent an +actual checkout. + ## 2. Assertion Types and Grading Recipes ### Default grader contract @@ -35,6 +47,8 @@ When `assertions` is present, the list is explicit: run only the declared assertions/graders. `expected_output` remains reference data for graders that consume it, such as `llm-grader`, `code-grader`, or `field-accuracy`; it does not trigger an additional default `llm-grader`. +When the declared assertion strings fully express the semantic contract, do not +also add a duplicate `criteria` block. For each assertion type: YAML config fields, grading recipe (exact pseudocode for deterministic types), and PASS/FAIL conditions. diff --git a/skills-data/agentv-eval-review/SKILL.md b/skills-data/agentv-eval-review/SKILL.md index 6b1752b10..99c612dee 100644 --- a/skills-data/agentv-eval-review/SKILL.md +++ b/skills-data/agentv-eval-review/SKILL.md @@ -24,7 +24,10 @@ Walk every target eval file and report violations grouped by severity (error > w - Each entry under `tests` has `id`, `input`, and at least one of `criteria` / `expected_output` / `assertions` (error if missing). - File-typed inputs (`type: file`) use a leading `/` in their `path` (error if relative). - Tests have an `assertions` block — flag tests that rely solely on `expected_output` (warning). -- Detect `expected_output` prose patterns like "The agent should…" or "The output is…" (warning — prose belongs in `criteria`, structured matches in `assertions`). +- Flag `criteria` that duplicates assertion strings when `assertions` already express the grading contract (warning — remove the duplicate `criteria`). +- Prefer plain assertion strings over multiple named `type: llm-grader` blocks when the default LLM rubric grader can evaluate the checks (info unless custom prompts or grader targets are present). +- Detect `expected_output` prose patterns like "The agent should..." or "The output is..." (warning — `expected_output` should be a golden/reference answer; scoring rules belong in `assertions` or, for implicit-grader cases, `criteria`). +- For historical or repo-state evals, verify the relevant repo is pinned under `workspace.repos[].commit` or `workspace.repos[].base_commit`; a SHA mentioned only in prompt prose or metadata is not an operational checkout (warning). - Identical file inputs repeated across multiple tests in the same eval should be hoisted to a top-level `input` (info). - Eval files in the same directory should share a common `id` prefix (info — flag drift). diff --git a/skills-data/agentv-eval-writer/SKILL.md b/skills-data/agentv-eval-writer/SKILL.md index eaa2a19b8..482496c43 100644 --- a/skills-data/agentv-eval-writer/SKILL.md +++ b/skills-data/agentv-eval-writer/SKILL.md @@ -30,6 +30,13 @@ mutation under the parent `experiment:`. Use `@agentv/sdk` for TypeScript helper imports. Do not use `@agentv/eval` for new evals, examples, scaffolds, or skill guidance; it was a deprecated compatibility package and has been removed from this repository. +## Authoring Checklist + +- If `assertions` already state the grading contract, omit `criteria` instead of duplicating the same rubric twice. +- Prefer plain assertion strings for semantic checks when the default LLM rubric grader can judge them. Use multiple named `type: llm-grader` blocks only for custom prompts, custom grader targets, or intentionally separate grader panels. +- Write `expected_output` as a golden/reference answer the target could have produced. Do not write criteria, scoring instructions, or "the agent should..." rubric prose there. +- For historical or repo-state evals, materialize the repo under `workspace.repos[]` pinned to the commit under test. Mentioning a SHA only in prompt prose is not enough because the agent needs an actual checkout to inspect. + ## Evaluation Types AgentV evaluations measure **execution quality** — whether your agent or skill produces correct output when invoked. @@ -82,7 +89,6 @@ JSON messages: ```yaml tests: - id: multi-turn-context - criteria: "Agent remembers prior context" input: - role: user content: "My name is Alice" @@ -106,7 +112,6 @@ experiment: tests: - id: greeting - criteria: Friendly greeting input: "Say hello" expected_output: "Hello! How can I help you?" assertions: @@ -124,7 +129,7 @@ tests: | Field | Required | Description | |-------|----------|-------------| | `id` | yes | Unique identifier | -| `criteria` | yes | What the response should accomplish | +| `criteria` | conditional | What the response should accomplish; required only when no `expected_output` or `assertions` are present | | `input` | yes | Input to the agent (string/object shorthand or full message array) | | `expected_output` | no | Gold-standard reference answer (string shorthand or full message array) | | `assertions` | no | Graders: deterministic checks, rubrics, and LLM/code graders | @@ -212,7 +217,6 @@ assertions: tests: - id: test-1 - criteria: Returns a useful status payload input: Get status assertions: - type: equals @@ -224,6 +228,35 @@ Plain strings in `assertions` are rubric criteria and are the preferred shape fo qualitative agent behavior. Use deterministic assertions (`contains`, `regex`, `is-json`, `equals`) only for exact machine-verifiable outputs, and code graders when the check must inspect files, run commands, or validate structured state. +Do not add a separate `criteria` field that just repeats these assertion strings. + +For repo-state evals, combine a pinned checkout, a golden answer, and assertion +shorthand: + +```yaml +workspace: + repos: + - path: ./agentv + repo: https://github.com/EntityProcess/agentv.git + commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215 + +tests: + - id: verification-learning-capture + input: | + The eval harness has prepared ./agentv at the commit before the + verification guidance was added. + + Decide what durable repo change should be made and explain why. + expected_output: | + The durable repo change is to update .agents/verification.md with the + reusable verification workflow lessons. AGENTS.md already routes this + class of work to .agents/verification.md, so no extra AGENTS.md edit is + needed unless that routing is missing. + assertions: + - The answer recommends updating .agents/verification.md rather than leaving the learning only in PR comments or private evidence. + - The answer uses the pinned ./agentv checkout to verify the AGENTS.md routing. + - The answer preserves the historical commit SHA as context. +``` ## How `criteria` and `assertions` Interact @@ -256,7 +289,6 @@ target, declare `llm-grader` explicitly: ```yaml tests: - id: mixed-eval - criteria: Response is helpful and mentions the fix input: "Debug this function..." assertions: - Explains why the bug happens @@ -277,6 +309,10 @@ tests: # Warning: criteria is defined but no grader in assertions will evaluate it. ``` +If plain assertion strings fully express the semantic contract, leave `criteria` +out. Keep `criteria` for the implicit-grader path or for non-duplicative context +that a declared grader actually needs. + ## Required Gates Any grader can be marked `required` to enforce a minimum score: @@ -319,7 +355,6 @@ workspace: tests: - id: case-1 input: Fix the bug - criteria: Bug is fixed metadata: source_repo: sympy/sympy source_commit: "abc123" @@ -330,6 +365,9 @@ tests: **Commands receive stdin JSON:** `{workspace_path, test_id, eval_run_id, case_input, case_metadata}` **Setup failure:** aborts case. **Teardown failure:** non-fatal (warning). For SWE-bench-style evals, keep operational checkout state under `workspace.repos[].base_commit`; treat `metadata.source_commit` as informational only. +For historical repo-state evals, pin `workspace.repos[].commit` or +`workspace.repos[].base_commit` to the commit under test. A SHA in the prompt or +metadata without a matching workspace repo pin is not an operational checkout. ### Repository Lifecycle From a0c90cc58fd81afc28dad108632b272974f08559 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 27 Jun 2026 12:57:27 +0200 Subject: [PATCH 2/2] docs(eval): align validate field requirements --- apps/web/src/content/docs/docs/tools/validate.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/apps/web/src/content/docs/docs/tools/validate.mdx b/apps/web/src/content/docs/docs/tools/validate.mdx index f8a4f32d8..c1ad27644 100644 --- a/apps/web/src/content/docs/docs/tools/validate.mdx +++ b/apps/web/src/content/docs/docs/tools/validate.mdx @@ -22,7 +22,7 @@ agentv validate evals/**/*.yaml ## What It Checks - YAML/JSONL syntax -- Required fields (id, input, criteria) +- Required fields: `id`, `input`, and at least one of `criteria`, `expected_output`, `assertions`, or `turns` - Grader references (command paths, prompt files) - Target references match entries in `targets.yaml` - Rubric structure and field types