From b5336d86bcbd6f4a046dce2d4d4ccefa0c0a3917 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Sat, 27 Jun 2026 12:45:47 +0200
Subject: [PATCH 1/2] docs(eval): clarify eval authoring contracts

---
 .../docs/docs/evaluation/eval-cases.mdx       | 56 ++++++++++++++-----
 .../docs/docs/evaluation/eval-files.mdx       | 16 ++++--
 .../docs/docs/guides/eval-authoring.mdx       | 26 +++++++++
 .../agentv-bench/references/eval-yaml-spec.md | 16 +++++-
 skills-data/agentv-eval-review/SKILL.md       |  5 +-
 skills-data/agentv-eval-writer/SKILL.md       | 50 +++++++++++++++--
 6 files changed, 142 insertions(+), 27 deletions(-)

diff --git a/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx b/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx
index e5db2e8f1..8f19727c5 100644
--- a/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx
+++ b/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx
@@ -12,11 +12,11 @@ Tests are individual test entries within an evaluation file. Each test defines i
 ```yaml
 tests:
   - id: addition
-    criteria: Correctly calculates 15 + 27 = 42
-
     input: What is 15 + 27?
 
     expected_output: "42"
+    assertions:
+      - The answer is exactly 42
 ```
 
 ## Fields
@@ -24,7 +24,7 @@ tests:
 | Field | Required | Description |
 |-------|----------|-------------|
 | `id` | Yes | Unique identifier for the test |
-| `criteria` | Yes | Description of what a correct response should contain |
+| `criteria` | Conditional | Description of what a correct response should contain. Required only when the case has no `expected_output` or `assertions` |
 | `input` | Yes | Input sent to the target (string, object, or message array) |
 | `expected_output` | No | Expected response for comparison (string, object, or message array) |
 | `execution` | No | Per-case execution overrides (for example `target`, `skip_defaults`) |
@@ -67,11 +67,13 @@ When suite-level `input` is defined in the eval file, those messages are prepend
 
 ## Expected Output
 
-Optional reference response for comparison by graders. `expected_output` is passive reference
-data: it is stored on the case and passed to graders, but it does not choose a grader by
-itself when `assertions` is present. Add an explicit `llm-grader`, `code-grader`,
-`field-accuracy`, or another reference-aware grader when you want the reference answer
-evaluated.
+Optional reference response for comparison by graders. Write `expected_output` as
+a golden answer or reference response the target could have produced, not as a
+rubric or "the agent should..." criteria list. `expected_output` is passive
+reference data: it is stored on the case and passed to graders, but it does not
+choose a grader by itself when `assertions` is present. Add explicit assertion
+strings, `llm-grader`, `code-grader`, `field-accuracy`, or another
+reference-aware grader when you want the reference answer evaluated.
 
 A string expands to a single assistant message:
 
@@ -164,7 +166,6 @@ Pass arbitrary key-value pairs to lifecycle commands via the `metadata` field. T
 ```yaml
 tests:
   - id: sympy-20590
-    criteria: Bug should be fixed
     input: Fix the diophantine equation bug in repo/.
     metadata:
       source_repo: sympy/sympy
@@ -182,6 +183,17 @@ tests:
 
 The `metadata` field is included in the stdin JSON passed to lifecycle commands as `case_metadata`.
 Operational checkout state belongs under `workspace.repos[].base_commit`; matching metadata fields such as `source_commit` are informational only.
+For historical repo-state evals, pin the checkout under `workspace.repos[]`
+instead of only mentioning the SHA in prompt prose:
+
+```yaml
+workspace:
+  repos:
+    - path: ./agentv
+      repo: https://github.com/EntityProcess/agentv.git
+      commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215
+```
+
 For benchmark task packs with source pins, patches, generated rows, and
 supporting files, see [Benchmark Provenance](/docs/guides/benchmark-provenance/).
 
@@ -197,7 +209,6 @@ AgentV groups the strings into a rubric grader automatically:
 ```yaml
 tests:
   - id: bug-fix-review
-    criteria: Finds and fixes the bug
     input: Review this failing parser implementation.
     assertions:
       - Identifies the root cause of the parser failure
@@ -206,7 +217,10 @@ tests:
 ```
 
 Use this shape for qualitative requirements. It is less brittle than checking
-for exact substrings in an agent response.
+for exact substrings in an agent response. When these strings fully define the
+grading contract, do not add a `criteria` field that repeats the same rubric.
+Declare `type: llm-grader` explicitly only when you need a custom prompt, custom
+grader target, or a deliberately separate grader panel.
 
 ### Deterministic Assertions
 
@@ -232,7 +246,6 @@ Underscore variants (`contains_all`, `is_json`, etc.) are also accepted.
 ```yaml
 tests:
   - id: json-api
-    criteria: Returns valid JSON with status field
     input: Return the system status as JSON
     assertions:
       - type: is-json
@@ -247,14 +260,12 @@ Use `contains-all` or `contains-any` to check multiple values in a single assert
 ```yaml
 tests:
   - id: required-fields
-    criteria: Response mentions all required fields
     input: "Confirm details: name is Alice, email is alice@example.com"
     assertions:
       - type: contains-all
         value: ["Alice", "alice@example.com"]
 
   - id: greeting-variant
-    criteria: Response includes some form of greeting
     input: "Greet the user warmly."
     assertions:
       - type: contains-any
@@ -411,6 +422,23 @@ tests:
         value: "4"
 ```
 
+For contract-style evals where assertion strings express every semantic check,
+omit `criteria`:
+
+```yaml
+tests:
+  - id: verification-learning-capture
+    input: |
+      Decide what durable repo change should be made after a PR closeout
+      revealed reusable verification workflow lessons.
+    expected_output: |
+      The durable repo change is to update .agents/verification.md with the
+      reusable verification workflow lessons.
+    assertions:
+      - The answer recommends updating .agents/verification.md rather than leaving the learning only in PR comments or private evidence.
+      - The answer avoids preserving one-off observations as durable guidance.
+```
+
 If `assertions` contains only deterministic graders (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
 
 ```
diff --git a/apps/web/src/content/docs/docs/evaluation/eval-files.mdx b/apps/web/src/content/docs/docs/evaluation/eval-files.mdx
index 84a8bf437..3d7cb3aad 100644
--- a/apps/web/src/content/docs/docs/evaluation/eval-files.mdx
+++ b/apps/web/src/content/docs/docs/evaluation/eval-files.mdx
@@ -23,13 +23,11 @@ experiment:
   target: default
 
 assertions:
-  - name: correctness
-    type: llm-grader
-    prompt: ./graders/correctness.md
+  - Correctly calculates the answer
+  - Explains the calculation briefly
 
 tests:
   - id: addition
-    criteria: Correctly calculates 15 + 27 = 42
     input: What is 15 + 27?
     expected_output: "42"
 ```
@@ -46,6 +44,11 @@ tests:
 | `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test |
 | `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |
 
+For historical or repo-state evals, put the checkout under
+`workspace.repos[].commit` or `workspace.repos[].base_commit`. A commit SHA in
+the prompt or metadata is useful context, but it does not materialize a repo for
+the agent to inspect.
+
 ### Metadata Fields
 
 You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the `name` field is present:
@@ -82,6 +85,10 @@ The `assertions` field is the canonical way to define suite-level graders. Suite
 For semantic or agent-behavior checks, prefer plain assertion strings first;
 AgentV treats them as rubric criteria. Use deterministic assertions or code
 graders when the expected output is exact or requires programmatic inspection.
+If the assertion strings already state the grading contract, omit a duplicate
+`criteria` field on each test. Use explicit `type: llm-grader` entries only
+when you need a custom prompt, a custom grader target, or a deliberately
+separate grader panel.
 
 ```yaml
 description: API response validation
@@ -95,7 +102,6 @@ assertions:
 
 tests:
   - id: health-check
-    criteria: Returns health status
     input: Check API health
 ```
 
diff --git a/apps/web/src/content/docs/docs/guides/eval-authoring.mdx b/apps/web/src/content/docs/docs/guides/eval-authoring.mdx
index bdee29168..6dda5efbf 100644
--- a/apps/web/src/content/docs/docs/guides/eval-authoring.mdx
+++ b/apps/web/src/content/docs/docs/guides/eval-authoring.mdx
@@ -150,3 +150,29 @@ When you don't want to maintain actual diffs, describe the changes inline:
 ```
 
 This avoids workspace state issues entirely — the agent evaluates the diff as presented without checking `git diff`.
+
+## Historical Repo State: Pin the Checkout
+
+If a test asks the agent to inspect how a repository looked at a past commit,
+declare that checkout in `workspace.repos[]`. Do not rely on prompt prose that
+mentions a SHA without materializing the repo.
+
+```yaml
+workspace:
+  repos:
+    - path: ./agentv
+      repo: https://github.com/EntityProcess/agentv.git
+      commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215
+
+tests:
+  - id: verification-learning-capture
+    input: |
+      The eval harness has prepared ./agentv at the historical commit.
+      Use that checkout to decide which durable guidance should change.
+    expected_output: |
+      The durable repo change is to update .agents/verification.md with the
+      reusable verification workflow lessons.
+    assertions:
+      - The answer uses the pinned ./agentv checkout to verify the existing guidance.
+      - The answer preserves the historical commit SHA as context.
+```
diff --git a/skills-data/agentv-bench/references/eval-yaml-spec.md b/skills-data/agentv-bench/references/eval-yaml-spec.md
index db71f1d88..b2285993a 100644
--- a/skills-data/agentv-bench/references/eval-yaml-spec.md
+++ b/skills-data/agentv-bench/references/eval-yaml-spec.md
@@ -10,7 +10,7 @@ The grader agent uses this to evaluate assertions without the CLI.
 - `name` (string, optional) — eval name
 - `description` (string, optional) — description
 - `execution` (object, optional) — `target`, `model`, etc.
-- `workspace` (object, optional) — workspace config (template, hooks)
+- `workspace` (object, optional) — workspace config (template, repos, hooks)
 - `input` (string | object | Message | Message[], optional) — suite-level input prepended to each test. String/block shorthand expands to a user message.
 - `tests` (array, required) — test cases
 
@@ -24,6 +24,18 @@ The grader agent uses this to evaluate assertions without the CLI.
 - `conversation_id` (string, optional) — groups related tests
 - `execution` (object, optional) — per-test execution override
 
+If `assertions` already state the grading contract, omit `criteria` instead of
+duplicating the same rubric. Prefer plain assertion strings for semantic checks
+when the default LLM rubric grader can judge them; use multiple named
+`type: llm-grader` blocks only for custom prompts, custom grader targets, or
+intentional grader panels. Write `expected_output` as a golden/reference answer,
+not as criteria or scoring instructions.
+
+For historical or repo-state evals, materialize the repository under
+`workspace.repos[]` and pin `commit` or `base_commit` to the commit under test.
+A SHA in prompt prose or metadata is context only; it does not give the agent an
+actual checkout.
+
 ## 2. Assertion Types and Grading Recipes
 
 ### Default grader contract
@@ -35,6 +47,8 @@ When `assertions` is present, the list is explicit: run only the declared
 assertions/graders. `expected_output` remains reference data for graders that consume it,
 such as `llm-grader`, `code-grader`, or `field-accuracy`; it does not trigger an additional
 default `llm-grader`.
+When the declared assertion strings fully express the semantic contract, do not
+also add a duplicate `criteria` block.
 
 For each assertion type: YAML config fields, grading recipe (exact pseudocode for deterministic types), and PASS/FAIL conditions.
 
diff --git a/skills-data/agentv-eval-review/SKILL.md b/skills-data/agentv-eval-review/SKILL.md
index 6b1752b10..99c612dee 100644
--- a/skills-data/agentv-eval-review/SKILL.md
+++ b/skills-data/agentv-eval-review/SKILL.md
@@ -24,7 +24,10 @@ Walk every target eval file and report violations grouped by severity (error > w
 - Each entry under `tests` has `id`, `input`, and at least one of `criteria` / `expected_output` / `assertions` (error if missing).
 - File-typed inputs (`type: file`) use a leading `/` in their `path` (error if relative).
 - Tests have an `assertions` block — flag tests that rely solely on `expected_output` (warning).
-- Detect `expected_output` prose patterns like "The agent should…" or "The output is…" (warning — prose belongs in `criteria`, structured matches in `assertions`).
+- Flag `criteria` that duplicates assertion strings when `assertions` already express the grading contract (warning — remove the duplicate `criteria`).
+- Prefer plain assertion strings over multiple named `type: llm-grader` blocks when the default LLM rubric grader can evaluate the checks (info unless custom prompts or grader targets are present).
+- Detect `expected_output` prose patterns like "The agent should..." or "The output is..." (warning — `expected_output` should be a golden/reference answer; scoring rules belong in `assertions` or, for implicit-grader cases, `criteria`).
+- For historical or repo-state evals, verify the relevant repo is pinned under `workspace.repos[].commit` or `workspace.repos[].base_commit`; a SHA mentioned only in prompt prose or metadata is not an operational checkout (warning).
 - Identical file inputs repeated across multiple tests in the same eval should be hoisted to a top-level `input` (info).
 - Eval files in the same directory should share a common `id` prefix (info — flag drift).
 
diff --git a/skills-data/agentv-eval-writer/SKILL.md b/skills-data/agentv-eval-writer/SKILL.md
index eaa2a19b8..482496c43 100644
--- a/skills-data/agentv-eval-writer/SKILL.md
+++ b/skills-data/agentv-eval-writer/SKILL.md
@@ -30,6 +30,13 @@ mutation under the parent `experiment:`.
 
 Use `@agentv/sdk` for TypeScript helper imports. Do not use `@agentv/eval` for new evals, examples, scaffolds, or skill guidance; it was a deprecated compatibility package and has been removed from this repository.
 
+## Authoring Checklist
+
+- If `assertions` already state the grading contract, omit `criteria` instead of duplicating the same rubric twice.
+- Prefer plain assertion strings for semantic checks when the default LLM rubric grader can judge them. Use multiple named `type: llm-grader` blocks only for custom prompts, custom grader targets, or intentionally separate grader panels.
+- Write `expected_output` as a golden/reference answer the target could have produced. Do not write criteria, scoring instructions, or "the agent should..." rubric prose there.
+- For historical or repo-state evals, materialize the repo under `workspace.repos[]` pinned to the commit under test. Mentioning a SHA only in prompt prose is not enough because the agent needs an actual checkout to inspect.
+
 ## Evaluation Types
 
 AgentV evaluations measure **execution quality** — whether your agent or skill produces correct output when invoked.
@@ -82,7 +89,6 @@ JSON messages:
 ```yaml
 tests:
   - id: multi-turn-context
-    criteria: "Agent remembers prior context"
     input:
       - role: user
         content: "My name is Alice"
@@ -106,7 +112,6 @@ experiment:
 
 tests:
   - id: greeting
-    criteria: Friendly greeting
     input: "Say hello"
     expected_output: "Hello! How can I help you?"
     assertions:
@@ -124,7 +129,7 @@ tests:
 | Field | Required | Description |
 |-------|----------|-------------|
 | `id` | yes | Unique identifier |
-| `criteria` | yes | What the response should accomplish |
+| `criteria` | conditional | What the response should accomplish; required only when no `expected_output` or `assertions` are present |
 | `input` | yes | Input to the agent (string/object shorthand or full message array) |
 | `expected_output` | no | Gold-standard reference answer (string shorthand or full message array) |
 | `assertions` | no | Graders: deterministic checks, rubrics, and LLM/code graders |
@@ -212,7 +217,6 @@ assertions:
 
 tests:
   - id: test-1
-    criteria: Returns a useful status payload
     input: Get status
     assertions:
       - type: equals
@@ -224,6 +228,35 @@ Plain strings in `assertions` are rubric criteria and are the preferred shape fo
 qualitative agent behavior. Use deterministic assertions (`contains`, `regex`,
 `is-json`, `equals`) only for exact machine-verifiable outputs, and code graders
 when the check must inspect files, run commands, or validate structured state.
+Do not add a separate `criteria` field that just repeats these assertion strings.
+
+For repo-state evals, combine a pinned checkout, a golden answer, and assertion
+shorthand:
+
+```yaml
+workspace:
+  repos:
+    - path: ./agentv
+      repo: https://github.com/EntityProcess/agentv.git
+      commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215
+
+tests:
+  - id: verification-learning-capture
+    input: |
+      The eval harness has prepared ./agentv at the commit before the
+      verification guidance was added.
+
+      Decide what durable repo change should be made and explain why.
+    expected_output: |
+      The durable repo change is to update .agents/verification.md with the
+      reusable verification workflow lessons. AGENTS.md already routes this
+      class of work to .agents/verification.md, so no extra AGENTS.md edit is
+      needed unless that routing is missing.
+    assertions:
+      - The answer recommends updating .agents/verification.md rather than leaving the learning only in PR comments or private evidence.
+      - The answer uses the pinned ./agentv checkout to verify the AGENTS.md routing.
+      - The answer preserves the historical commit SHA as context.
+```
 
 ## How `criteria` and `assertions` Interact
 
@@ -256,7 +289,6 @@ target, declare `llm-grader` explicitly:
 ```yaml
 tests:
   - id: mixed-eval
-    criteria: Response is helpful and mentions the fix
     input: "Debug this function..."
     assertions:
       - Explains why the bug happens
@@ -277,6 +309,10 @@ tests:
     # Warning: criteria is defined but no grader in assertions will evaluate it.
 ```
 
+If plain assertion strings fully express the semantic contract, leave `criteria`
+out. Keep `criteria` for the implicit-grader path or for non-duplicative context
+that a declared grader actually needs.
+
 ## Required Gates
 
 Any grader can be marked `required` to enforce a minimum score:
@@ -319,7 +355,6 @@ workspace:
 tests:
   - id: case-1
     input: Fix the bug
-    criteria: Bug is fixed
     metadata:
       source_repo: sympy/sympy
       source_commit: "abc123"
@@ -330,6 +365,9 @@ tests:
 **Commands receive stdin JSON:** `{workspace_path, test_id, eval_run_id, case_input, case_metadata}`
 **Setup failure:** aborts case. **Teardown failure:** non-fatal (warning).
 For SWE-bench-style evals, keep operational checkout state under `workspace.repos[].base_commit`; treat `metadata.source_commit` as informational only.
+For historical repo-state evals, pin `workspace.repos[].commit` or
+`workspace.repos[].base_commit` to the commit under test. A SHA in the prompt or
+metadata without a matching workspace repo pin is not an operational checkout.
 
 ### Repository Lifecycle
 

From a0c90cc58fd81afc28dad108632b272974f08559 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Sat, 27 Jun 2026 12:57:27 +0200
Subject: [PATCH 2/2] docs(eval): align validate field requirements

---
 apps/web/src/content/docs/docs/tools/validate.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/apps/web/src/content/docs/docs/tools/validate.mdx b/apps/web/src/content/docs/docs/tools/validate.mdx
index f8a4f32d8..c1ad27644 100644
--- a/apps/web/src/content/docs/docs/tools/validate.mdx
+++ b/apps/web/src/content/docs/docs/tools/validate.mdx
@@ -22,7 +22,7 @@ agentv validate evals/**/*.yaml
 ## What It Checks
 
 - YAML/JSONL syntax
-- Required fields (id, input, criteria)
+- Required fields: `id`, `input`, and at least one of `criteria`, `expected_output`, `assertions`, or `turns`
 - Grader references (command paths, prompt files)
 - Target references match entries in `targets.yaml`
 - Rubric structure and field types