Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 42 additions & 14 deletions apps/web/src/content/docs/docs/evaluation/eval-cases.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ Tests are individual test entries within an evaluation file. Each test defines i
```yaml
tests:
- id: addition
criteria: Correctly calculates 15 + 27 = 42

input: What is 15 + 27?

expected_output: "42"
assertions:
- The answer is exactly 42
```

## Fields

| Field | Required | Description |
|-------|----------|-------------|
| `id` | Yes | Unique identifier for the test |
| `criteria` | Yes | Description of what a correct response should contain |
| `criteria` | Conditional | Description of what a correct response should contain. Required only when the case has no `expected_output` or `assertions` |
| `input` | Yes | Input sent to the target (string, object, or message array) |
| `expected_output` | No | Expected response for comparison (string, object, or message array) |
| `execution` | No | Per-case execution overrides (for example `target`, `skip_defaults`) |
Expand Down Expand Up @@ -67,11 +67,13 @@ When suite-level `input` is defined in the eval file, those messages are prepend

## Expected Output

Optional reference response for comparison by graders. `expected_output` is passive reference
data: it is stored on the case and passed to graders, but it does not choose a grader by
itself when `assertions` is present. Add an explicit `llm-grader`, `code-grader`,
`field-accuracy`, or another reference-aware grader when you want the reference answer
evaluated.
Optional reference response for comparison by graders. Write `expected_output` as
a golden answer or reference response the target could have produced, not as a
rubric or "the agent should..." criteria list. `expected_output` is passive
reference data: it is stored on the case and passed to graders, but it does not
choose a grader by itself when `assertions` is present. Add explicit assertion
strings, `llm-grader`, `code-grader`, `field-accuracy`, or another
reference-aware grader when you want the reference answer evaluated.

A string expands to a single assistant message:

Expand Down Expand Up @@ -164,7 +166,6 @@ Pass arbitrary key-value pairs to lifecycle commands via the `metadata` field. T
```yaml
tests:
- id: sympy-20590
criteria: Bug should be fixed
input: Fix the diophantine equation bug in repo/.
metadata:
source_repo: sympy/sympy
Expand All @@ -182,6 +183,17 @@ tests:

The `metadata` field is included in the stdin JSON passed to lifecycle commands as `case_metadata`.
Operational checkout state belongs under `workspace.repos[].base_commit`; matching metadata fields such as `source_commit` are informational only.
For historical repo-state evals, pin the checkout under `workspace.repos[]`
instead of only mentioning the SHA in prompt prose:

```yaml
workspace:
repos:
- path: ./agentv
repo: https://github.com/EntityProcess/agentv.git
commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215
```

For benchmark task packs with source pins, patches, generated rows, and
supporting files, see [Benchmark Provenance](/docs/guides/benchmark-provenance/).

Expand All @@ -197,7 +209,6 @@ AgentV groups the strings into a rubric grader automatically:
```yaml
tests:
- id: bug-fix-review
criteria: Finds and fixes the bug
input: Review this failing parser implementation.
assertions:
- Identifies the root cause of the parser failure
Expand All @@ -206,7 +217,10 @@ tests:
```

Use this shape for qualitative requirements. It is less brittle than checking
for exact substrings in an agent response.
for exact substrings in an agent response. When these strings fully define the
grading contract, do not add a `criteria` field that repeats the same rubric.
Declare `type: llm-grader` explicitly only when you need a custom prompt, custom
grader target, or a deliberately separate grader panel.

### Deterministic Assertions

Expand All @@ -232,7 +246,6 @@ Underscore variants (`contains_all`, `is_json`, etc.) are also accepted.
```yaml
tests:
- id: json-api
criteria: Returns valid JSON with status field
input: Return the system status as JSON
assertions:
- type: is-json
Expand All @@ -247,14 +260,12 @@ Use `contains-all` or `contains-any` to check multiple values in a single assert
```yaml
tests:
- id: required-fields
criteria: Response mentions all required fields
input: "Confirm details: name is Alice, email is alice@example.com"
assertions:
- type: contains-all
value: ["Alice", "alice@example.com"]

- id: greeting-variant
criteria: Response includes some form of greeting
input: "Greet the user warmly."
assertions:
- type: contains-any
Expand Down Expand Up @@ -411,6 +422,23 @@ tests:
value: "4"
```

For contract-style evals where assertion strings express every semantic check,
omit `criteria`:

```yaml
tests:
- id: verification-learning-capture
input: |
Decide what durable repo change should be made after a PR closeout
revealed reusable verification workflow lessons.
expected_output: |
The durable repo change is to update .agents/verification.md with the
reusable verification workflow lessons.
assertions:
- The answer recommends updating .agents/verification.md rather than leaving the learning only in PR comments or private evidence.
- The answer avoids preserving one-off observations as durable guidance.
```

If `assertions` contains only deterministic graders (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:

```
Expand Down
16 changes: 11 additions & 5 deletions apps/web/src/content/docs/docs/evaluation/eval-files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,11 @@ experiment:
target: default

assertions:
- name: correctness
type: llm-grader
prompt: ./graders/correctness.md
- Correctly calculates the answer
- Explains the calculation briefly

tests:
- id: addition
criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"
```
Expand All @@ -46,6 +44,11 @@ tests:
| `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test |
| `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |

For historical or repo-state evals, put the checkout under
`workspace.repos[].commit` or `workspace.repos[].base_commit`. A commit SHA in
the prompt or metadata is useful context, but it does not materialize a repo for
the agent to inspect.

### Metadata Fields

You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the `name` field is present:
Expand Down Expand Up @@ -82,6 +85,10 @@ The `assertions` field is the canonical way to define suite-level graders. Suite
For semantic or agent-behavior checks, prefer plain assertion strings first;
AgentV treats them as rubric criteria. Use deterministic assertions or code
graders when the expected output is exact or requires programmatic inspection.
If the assertion strings already state the grading contract, omit a duplicate
`criteria` field on each test. Use explicit `type: llm-grader` entries only
when you need a custom prompt, a custom grader target, or a deliberately
separate grader panel.

```yaml
description: API response validation
Expand All @@ -95,7 +102,6 @@ assertions:

tests:
- id: health-check
criteria: Returns health status
input: Check API health
```

Expand Down
26 changes: 26 additions & 0 deletions apps/web/src/content/docs/docs/guides/eval-authoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -150,3 +150,29 @@ When you don't want to maintain actual diffs, describe the changes inline:
```

This avoids workspace state issues entirely — the agent evaluates the diff as presented without checking `git diff`.

## Historical Repo State: Pin the Checkout

If a test asks the agent to inspect how a repository looked at a past commit,
declare that checkout in `workspace.repos[]`. Do not rely on prompt prose that
mentions a SHA without materializing the repo.

```yaml
workspace:
repos:
- path: ./agentv
repo: https://github.com/EntityProcess/agentv.git
commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215

tests:
- id: verification-learning-capture
input: |
The eval harness has prepared ./agentv at the historical commit.
Use that checkout to decide which durable guidance should change.
expected_output: |
The durable repo change is to update .agents/verification.md with the
reusable verification workflow lessons.
assertions:
- The answer uses the pinned ./agentv checkout to verify the existing guidance.
- The answer preserves the historical commit SHA as context.
```
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/tools/validate.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ agentv validate evals/**/*.yaml
## What It Checks

- YAML/JSONL syntax
- Required fields (id, input, criteria)
- Required fields: `id`, `input`, and at least one of `criteria`, `expected_output`, `assertions`, or `turns`
- Grader references (command paths, prompt files)
- Target references match entries in `targets.yaml`
- Rubric structure and field types
Expand Down
16 changes: 15 additions & 1 deletion skills-data/agentv-bench/references/eval-yaml-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The grader agent uses this to evaluate assertions without the CLI.
- `name` (string, optional) — eval name
- `description` (string, optional) — description
- `execution` (object, optional) — `target`, `model`, etc.
- `workspace` (object, optional) — workspace config (template, hooks)
- `workspace` (object, optional) — workspace config (template, repos, hooks)
- `input` (string | object | Message | Message[], optional) — suite-level input prepended to each test. String/block shorthand expands to a user message.
- `tests` (array, required) — test cases

Expand All @@ -24,6 +24,18 @@ The grader agent uses this to evaluate assertions without the CLI.
- `conversation_id` (string, optional) — groups related tests
- `execution` (object, optional) — per-test execution override

If `assertions` already state the grading contract, omit `criteria` instead of
duplicating the same rubric. Prefer plain assertion strings for semantic checks
when the default LLM rubric grader can judge them; use multiple named
`type: llm-grader` blocks only for custom prompts, custom grader targets, or
intentional grader panels. Write `expected_output` as a golden/reference answer,
not as criteria or scoring instructions.

For historical or repo-state evals, materialize the repository under
`workspace.repos[]` and pin `commit` or `base_commit` to the commit under test.
A SHA in prompt prose or metadata is context only; it does not give the agent an
actual checkout.

## 2. Assertion Types and Grading Recipes

### Default grader contract
Expand All @@ -35,6 +47,8 @@ When `assertions` is present, the list is explicit: run only the declared
assertions/graders. `expected_output` remains reference data for graders that consume it,
such as `llm-grader`, `code-grader`, or `field-accuracy`; it does not trigger an additional
default `llm-grader`.
When the declared assertion strings fully express the semantic contract, do not
also add a duplicate `criteria` block.

For each assertion type: YAML config fields, grading recipe (exact pseudocode for deterministic types), and PASS/FAIL conditions.

Expand Down
5 changes: 4 additions & 1 deletion skills-data/agentv-eval-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ Walk every target eval file and report violations grouped by severity (error > w
- Each entry under `tests` has `id`, `input`, and at least one of `criteria` / `expected_output` / `assertions` (error if missing).
- File-typed inputs (`type: file`) use a leading `/` in their `path` (error if relative).
- Tests have an `assertions` block — flag tests that rely solely on `expected_output` (warning).
- Detect `expected_output` prose patterns like "The agent should…" or "The output is…" (warning — prose belongs in `criteria`, structured matches in `assertions`).
- Flag `criteria` that duplicates assertion strings when `assertions` already express the grading contract (warning — remove the duplicate `criteria`).
- Prefer plain assertion strings over multiple named `type: llm-grader` blocks when the default LLM rubric grader can evaluate the checks (info unless custom prompts or grader targets are present).
- Detect `expected_output` prose patterns like "The agent should..." or "The output is..." (warning — `expected_output` should be a golden/reference answer; scoring rules belong in `assertions` or, for implicit-grader cases, `criteria`).
- For historical or repo-state evals, verify the relevant repo is pinned under `workspace.repos[].commit` or `workspace.repos[].base_commit`; a SHA mentioned only in prompt prose or metadata is not an operational checkout (warning).
- Identical file inputs repeated across multiple tests in the same eval should be hoisted to a top-level `input` (info).
- Eval files in the same directory should share a common `id` prefix (info — flag drift).

Expand Down
50 changes: 44 additions & 6 deletions skills-data/agentv-eval-writer/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,13 @@ mutation under the parent `experiment:`.

Use `@agentv/sdk` for TypeScript helper imports. Do not use `@agentv/eval` for new evals, examples, scaffolds, or skill guidance; it was a deprecated compatibility package and has been removed from this repository.

## Authoring Checklist

- If `assertions` already state the grading contract, omit `criteria` instead of duplicating the same rubric twice.
- Prefer plain assertion strings for semantic checks when the default LLM rubric grader can judge them. Use multiple named `type: llm-grader` blocks only for custom prompts, custom grader targets, or intentionally separate grader panels.
- Write `expected_output` as a golden/reference answer the target could have produced. Do not write criteria, scoring instructions, or "the agent should..." rubric prose there.
- For historical or repo-state evals, materialize the repo under `workspace.repos[]` pinned to the commit under test. Mentioning a SHA only in prompt prose is not enough because the agent needs an actual checkout to inspect.

## Evaluation Types

AgentV evaluations measure **execution quality** — whether your agent or skill produces correct output when invoked.
Expand Down Expand Up @@ -82,7 +89,6 @@ JSON messages:
```yaml
tests:
- id: multi-turn-context
criteria: "Agent remembers prior context"
input:
- role: user
content: "My name is Alice"
Expand All @@ -106,7 +112,6 @@ experiment:

tests:
- id: greeting
criteria: Friendly greeting
input: "Say hello"
expected_output: "Hello! How can I help you?"
assertions:
Expand All @@ -124,7 +129,7 @@ tests:
| Field | Required | Description |
|-------|----------|-------------|
| `id` | yes | Unique identifier |
| `criteria` | yes | What the response should accomplish |
| `criteria` | conditional | What the response should accomplish; required only when no `expected_output` or `assertions` are present |
| `input` | yes | Input to the agent (string/object shorthand or full message array) |
| `expected_output` | no | Gold-standard reference answer (string shorthand or full message array) |
| `assertions` | no | Graders: deterministic checks, rubrics, and LLM/code graders |
Expand Down Expand Up @@ -212,7 +217,6 @@ assertions:

tests:
- id: test-1
criteria: Returns a useful status payload
input: Get status
assertions:
- type: equals
Expand All @@ -224,6 +228,35 @@ Plain strings in `assertions` are rubric criteria and are the preferred shape fo
qualitative agent behavior. Use deterministic assertions (`contains`, `regex`,
`is-json`, `equals`) only for exact machine-verifiable outputs, and code graders
when the check must inspect files, run commands, or validate structured state.
Do not add a separate `criteria` field that just repeats these assertion strings.

For repo-state evals, combine a pinned checkout, a golden answer, and assertion
shorthand:

```yaml
workspace:
repos:
- path: ./agentv
repo: https://github.com/EntityProcess/agentv.git
commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215

tests:
- id: verification-learning-capture
input: |
The eval harness has prepared ./agentv at the commit before the
verification guidance was added.

Decide what durable repo change should be made and explain why.
expected_output: |
The durable repo change is to update .agents/verification.md with the
reusable verification workflow lessons. AGENTS.md already routes this
class of work to .agents/verification.md, so no extra AGENTS.md edit is
needed unless that routing is missing.
assertions:
- The answer recommends updating .agents/verification.md rather than leaving the learning only in PR comments or private evidence.
- The answer uses the pinned ./agentv checkout to verify the AGENTS.md routing.
- The answer preserves the historical commit SHA as context.
```

## How `criteria` and `assertions` Interact

Expand Down Expand Up @@ -256,7 +289,6 @@ target, declare `llm-grader` explicitly:
```yaml
tests:
- id: mixed-eval
criteria: Response is helpful and mentions the fix
input: "Debug this function..."
assertions:
- Explains why the bug happens
Expand All @@ -277,6 +309,10 @@ tests:
# Warning: criteria is defined but no grader in assertions will evaluate it.
```

If plain assertion strings fully express the semantic contract, leave `criteria`
out. Keep `criteria` for the implicit-grader path or for non-duplicative context
that a declared grader actually needs.

## Required Gates

Any grader can be marked `required` to enforce a minimum score:
Expand Down Expand Up @@ -319,7 +355,6 @@ workspace:
tests:
- id: case-1
input: Fix the bug
criteria: Bug is fixed
metadata:
source_repo: sympy/sympy
source_commit: "abc123"
Expand All @@ -330,6 +365,9 @@ tests:
**Commands receive stdin JSON:** `{workspace_path, test_id, eval_run_id, case_input, case_metadata}`
**Setup failure:** aborts case. **Teardown failure:** non-fatal (warning).
For SWE-bench-style evals, keep operational checkout state under `workspace.repos[].base_commit`; treat `metadata.source_commit` as informational only.
For historical repo-state evals, pin `workspace.repos[].commit` or
`workspace.repos[].base_commit` to the commit under test. A SHA in the prompt or
metadata without a matching workspace repo pin is not an operational checkout.

### Repository Lifecycle

Expand Down
Loading