Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 32 additions & 11 deletions apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Benchmark suites usually need more than a prompt and a score. They carry source
pins, task patches, generated dataset rows, oracle data, setup scripts, and
verification commands. AgentV represents that with existing primitives:

- Put runtime behavior in `workspace`, `execution`, `input`, `expected_output`,
- Put runtime behavior in `workspace`, `experiment`, `input`, `expected_output`,
and `assertions`.
- Put provenance and classification in per-case `metadata`.
- Put bulky per-case authoring inputs in optional case directories and supporting files.
Expand All @@ -29,7 +29,7 @@ Use this split when deciding where a benchmark key belongs:
| `workspace.template` | Yes | Copies a workspace template into the run workspace. |
| `workspace.hooks` | Yes | Runs lifecycle commands with workspace and case context on stdin. |
| `workspace.isolation`, `workspace.mode`, `workspace.path` | Yes | Controls workspace reuse and materialization. |
| `execution` | Yes | Selects targets, thresholds, dependencies, and default grader behavior. |
| `experiment` | Yes | Selects targets, thresholds, repeat policy, budgets, workers, and default grader behavior. |
| `input`, `input_files`, `expected_output` | Yes | Builds the target prompt and passive reference answer. |
| `assertions` | Yes | Runs deterministic, LLM, composite, or code graders. |
| Top-level `name`, `version`, `tags`, `license`, `requires` | Informational | Identifies and categorizes the suite. |
Expand Down Expand Up @@ -162,7 +162,7 @@ workspace:
before_each:
command: ["python", "./scripts/apply-case-fixtures.py"]

execution:
experiment:
targets: [codex, claude]

assertions:
Expand All @@ -182,20 +182,41 @@ the imported results, and link Opik traces when Harbor uploads them.
# Proposed runner boundary, not a current AgentV task schema.
name: swebench-verified-codex

execution:
runner: harbor
harbor:
dataset: swebench-verified
agent: codex
model: openai/gpt-5-mini
opik:
enabled: true
experiment:
target: codex
model: openai/gpt-5-mini
runner:
type: harbor
options:
opik:
enabled: true
```

Do not translate Harbor `task.toml`, verifier packaging, or suite-specific
Docker/Compose adapter fields into AgentV core eval schema. If the benchmark's
runtime contract is already owned by Harbor, keep those details in Harbor and
let AgentV consume the job metadata, rewards, artifacts, and trace links.
Do not add a generic top-level `source` field just to identify Harbor. If a
future Harbor adapter needs suite selection, keep that selector narrow and
adapter-owned instead of making it the AgentV workspace model.

## Eval Composition

When one eval references another eval, preserve the task/runtime split:

- The parent runnable eval owns runtime `experiment:` for the run.
- Child `experiment:` blocks are ignored by `type: suite` composition. There is
no fallback to the child `experiment:` when the parent has no `experiment:`.
- Child `workspace` setup is preserved for `type: suite` imports. Parent
workspace applies to raw cases owned by the parent file, not to imported suite
tests.
- A tests-only import can drop child workspace context only when the import mode
says so explicitly.
- Workspace path collisions or incompatible isolation settings should fail
loudly if a future explicit remap mode is added.

That rule keeps imported benchmark cases attached to their setup while still
letting a parent eval compare targets, repeat policy, and gates consistently.

## Finance-Style Generated Dataset

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ Proposed
AgentV now has native workspace repository acquisition for custom evals, CI
gates, target comparisons, pooled workspaces, hooks, and Docker workspace cases.
That should remain generic infrastructure: `workspace.repos[].commit` is the
canonical checkout pin, and `workspace.repos[].base_commit` is only a
SWE-Bench-friendly alias for the same value.
canonical checkout pin. SWE-Bench `base_commit` is upstream/import vocabulary
that adapters may translate into `commit`; it should not become the canonical
AgentV workspace field.

Harbor owns benchmark-grade execution for standard suites such as SWE-Bench
Verified, Multi-SWE-Bench, Terminal-Bench, and suites with Harbor-specific
Expand All @@ -39,101 +40,79 @@ Harbor should own:
- Harbor `task.toml` files and Harbor YAML config;
- Opik trace upload through Harbor when enabled.

## Alignment with experiment separation
## Alignment with inline experiment runtime

The 2026-06-23 experiment/eval separation decision makes runtime binding an
experiment concern. Harbor execution should follow the same split:
The 2026-06-26 inline experiment decision keeps runtime binding inside the
single runnable `eval.yaml` artifact. Harbor execution should follow the same
split:

- AgentV eval YAML remains the authoring or selection surface for what benchmark
suite is being evaluated.
- AgentV experiment YAML selects or pins the Harbor runner, candidate
- The inline `experiment:` block selects or pins the Harbor runner, candidate
agent/model, run policy, and other runtime binding.
- Harbor-authored YAML remains Harbor's own config surface when the standard
suite needs Harbor-specific task packaging or verifier settings.

This means the examples below show the desired logical fields, but new
runtime fields should be placed on an experiment unless they are genuinely part
of the benchmark suite identity. Do not put candidate agent/model binding in the
eval file for new AgentV-native examples.
runtime fields should be placed under `experiment:` unless they are genuinely
part of the benchmark suite identity. Do not put candidate agent/model binding
under `source` for new AgentV-native examples.

## Minimal future config surface

An AgentV eval suite can select the benchmark source without copying Harbor's
task schema or claiming to be the runtime runner:
An AgentV eval suite should not gain a generic top-level `source` field just to
select Harbor. ADR 0009 keeps benchmark-shaped evals on existing primitives:
workspace setup belongs in `workspace`, runtime binding belongs in
`experiment`, and imported benchmark provenance belongs in metadata, sidecars,
or adapter manifests.

```yaml
name: swebench-verified

source:
type: harbor
dataset: swebench-verified
```

The corresponding experiment selects how that suite runs:
If a future Harbor adapter needs first-class selection in AgentV YAML, it should
be designed as a narrow runner/import selector after real usage, not as a broad
benchmark source schema. The corresponding inline experiment block still selects
how that suite runs:

```yaml
name: swebench-verified-codex
target: codex-gpt5-mini
evals: evals/swebench-verified.eval.yaml
runner:
type: harbor
options:
opik:
enabled: true
```

For a Harbor-authored YAML file, use `config` instead of `dataset`:

```yaml
source:
type: harbor
config: ./harbor/swebench-verified.yaml
experiment:
target: codex
model: openai/gpt-5-mini
runner:
type: harbor
options:
opik:
enabled: true
```

The first implementation should accept exactly one Harbor source selector:
`dataset` for a known Harbor dataset id, or `config` for an existing Harbor YAML
file. There should be no precedence rule between them. If both are set, fail
validation and ask the user to choose one.

Do not combine Harbor suite selection with candidate binding in the eval file:
Do not combine Harbor suite selection with candidate binding in an invented
source block:

```yaml
# Avoid in eval.yaml
execution:
source:
runner: harbor
harbor:
dataset: swebench-verified
agent: codex
model: openai/gpt-5-mini
dataset: swebench-verified
agent: codex
model: openai/gpt-5-mini
```

Split that shape across the suite and experiment instead:
Keep runtime binding in `experiment:` instead:

```yaml
# evals/swebench-verified.eval.yaml
name: swebench-verified
source:
type: harbor
dataset: swebench-verified
```

```yaml
# experiments/swebench-verified-codex.yaml
name: swebench-verified-codex
target: codex
model: openai/gpt-5-mini
evals: evals/swebench-verified.eval.yaml
runner:
type: harbor
experiment:
target: codex
model: openai/gpt-5-mini
runner:
type: harbor
```

Keep Harbor suite source selection under `source` in the eval suite. Keep
experiment-side runner selection under `runner.type`, with runner knobs under
`runner.options`. The eval suite answers "where do these cases come from?"; the
experiment answers "how is this run executed?" Do not use `execution.runner` in
new eval-suite examples because that name collides with the experiment runner.
Do not repeat the runner discriminator as `runner.harbor.options`; `type:
harbor` already provides that namespace.
Keep runtime runner selection under `experiment.runner.type`, with runner knobs
under `experiment.runner.options`. Do not use `execution.runner` in new
eval-suite examples because top-level `execution:` is only a legacy alias for
old eval files. Do not repeat the runner discriminator as
`runner.harbor.options`; `type: harbor` already provides that namespace.

Do not add top-level AgentV fields for Harbor task packaging, verifier images,
task patches, or Docker/Compose adapter settings. If a Harbor option becomes too
Expand All @@ -153,8 +132,9 @@ agentv eval evals/native.eval.yaml --target codex
```

Harbor-backed evals should use the same top-level entrypoint. If no explicit
experiment runner is configured, AgentV may infer Harbor execution from
`source.type: harbor`:
experiment runner is configured, a future adapter may infer Harbor execution
from an adapter-owned manifest or CLI flag, but this ADR does not add that
schema:

```bash
agentv eval evals/swebench-harbor.eval.yaml
Expand All @@ -176,9 +156,9 @@ agentv results import harbor --job <harbor-job-id>
```

Do not overload native `--target` semantics in the first Harbor runner slice.
Harbor `agent`, `model`, and matrix behavior should come from the experiment or
the referenced Harbor YAML until repeated usage proves a shared AgentV flag is
needed.
Harbor `agent`, `model`, and matrix behavior should come from inline
`experiment:` runtime fields or the referenced Harbor YAML until repeated usage
proves a shared AgentV flag is needed.

## Unsupported fields and non-goals

Expand All @@ -191,6 +171,8 @@ The Harbor runner mode should not add or interpret:
`base_commit` as Harbor runner inputs;
- generic `extra_args` or arbitrary pass-through maps in the initial AgentV
surface.
- generic top-level `source` as a replacement for AgentV `workspace` or
metadata conventions.

These fields remain valid in native AgentV evals when authors compose their own
workspace, hooks, and graders. They are non-goals only for the Harbor-backed
Expand All @@ -199,9 +181,9 @@ standard-suite path.
## Implementation sequencing

1. Document the native-vs-Harbor boundary and commit alias rules.
2. Add schema validation for eval-suite `source.type: harbor` and exactly one of
`source.dataset` or `source.config`, plus experiment `runner.type` and
`runner.options`, with no changes to native workspace acquisition.
2. Add a narrow Harbor runner/import selector only after repeated usage proves
it is needed; keep inline `experiment.runner.type` and
`experiment.runner.options`, with no changes to native workspace acquisition.
3. Add a Harbor launch adapter that records job identity and status.
4. Add a Harbor result importer that maps rewards, exceptions, timings,
artifacts, and Opik trace URLs into AgentV run bundles.
Expand Down
68 changes: 65 additions & 3 deletions docs/adr/0006-separate-experiments-from-eval-definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ The final design keeps the product boundary smaller:
- `experiment:` is an inline run-time block inside `eval.yaml`.
- `tests:` is the composition, import, and selection surface.
- result bundles are written under `.agentv/results/<eval-name>/<timestamp>/`.
- A directory named `experiments/` may be used as a user-owned repo convention
for wrapper eval YAML files, but it does not create a separate experiment
artifact type or schema-significant path.

This keeps AgentV repo-native and zero-infra by default, avoids a new public
artifact type, and still lets wrapper evals run multiple imported suites with a
Expand All @@ -45,7 +48,10 @@ Do not introduce or document:
- committed experiment files as the canonical authoring path

The only runnable authoring artifact is `eval.yaml` or another `*.eval.yaml`
file. Runtime controls live in an inline `experiment:` block:
file. A project may place wrapper eval files under an `experiments/` directory
when their main job is to bind runtime policy over reusable suites, but those
files are still ordinary eval YAML files. AgentV must not infer behavior from
that directory name. Runtime controls live in an inline `experiment:` block:

```yaml
name: cargowise-sql-migration-codex
Expand Down Expand Up @@ -103,6 +109,28 @@ The old experiment runtime fields are ported into the parent eval file:
Suite or case workspace fields remain task-owned when they define what is being
evaluated.

## Contract Layers

Parent-versus-child is not the main composition rule. Contract ownership is:

| Layer | AgentV fields | Owner in `type: suite` composition |
| --- | --- | --- |
| Task data | `tests`, case `metadata`, `expected_output` | Imported child suite |
| Task prompt | `input`, `input_files`, shared prompt defaults | Imported child suite |
| Task environment | `workspace`, `workspace.repos[]`, templates, workspace hooks | Imported child suite |
| Scoring | `assertions`, graders, expected references | Imported child suite |
| Run policy | `experiment`, CLI target flags, workers, repeat, gates, budget | Parent wrapper eval or CLI |
| Target runtime | selected target config and `targets[].hooks` | Selected target |

`workspace` can influence what an agent perceives through tools, but it is not
prompt input. `input` is what the agent is told; `workspace` is what the agent
can act on; `assertions` are how AgentV judges the result; `experiment` is how
the run is bound, repeated, compared, and gated.

This framing removes the apparent "child wins except experiment" exception.
Child suites own task contracts. Parent wrapper evals and CLI flags own run
contracts.

## Lifecycle Ownership

`experiment:` owns evaluation policy, not lifecycle mutation. Commands that
Expand Down Expand Up @@ -219,6 +247,25 @@ fields. Explicit override syntax can be considered later if a concrete use case
needs it, but the default composition model must not merge task contracts in a
surprising way.

If a parent eval defines `workspace` and imports child eval suites with
`type: suite`, the parent workspace applies only to raw cases owned by the
parent file. Imported suite tests keep their child suite workspace. This is a
valid mixed-case pattern when the parent owns raw cases, but it is usually a
DX smell when every test is a `type: suite` import. AgentV should warn or lint
that shape rather than silently implying a parent workspace override.

If a parent eval has no `experiment:` and imports child suites that do have
`experiment:` blocks, child runtime still does not fall back into the parent
run. AgentV should warn because authors often expect the child runtime to be
used. The correct choices are to run the child suite directly, add a parent
`experiment:` block, or pass CLI runtime flags.

Wrapper evals that import multiple suites with distinct shared workspace
contracts should fail fast or require per-test isolation, separate runs, or an
explicit future composition mode. Shared workspace setup is safe when one suite
owns the task contract; it is not a place for implicit parent-child or
child-child workspace merging.

## Runtime Overrides

The parent `experiment:` block is the default runtime policy for the whole eval.
Expand Down Expand Up @@ -330,6 +377,13 @@ remain inspectable without reading every manifest row. Test artifacts from tests
owned directly by the wrapper eval can still live directly under `<test-id>`.
All cases should also retain source suite metadata in manifests and index rows.

The result namespace remains `experiment` in artifacts and Dashboard. AgentV
should not introduce a separate authored `run_group` field. For better DX,
Dashboard and reports may derive display-only runtime source labels such as
`inline experiment`, `CLI`, `defaults`, or `mixed`, and may show the top-level
eval file plus imported suites. Those labels are explanations over existing
primitives, not new configuration surface.

## Consequences

Positive:
Expand All @@ -350,14 +404,22 @@ Negative:
evals.
- Explicit task-context override syntax is deferred, so authors who need
overrides must create a new suite or wait for a focused override design.
- Wrapper evals need diagnostics so authors understand that parent workspace
does not override imported suite workspaces and child experiment blocks are
ignored.

## Non-Goals

- Do not add separate `experiment.yaml` files or an `experiments/` convention.
- Do not add separate `experiment.yaml` files.
- Do not make `experiments/` a schema-significant directory or separate
artifact type; it may only be a repo layout convention for ordinary wrapper
eval YAML files.
- Do not add config pointers to external experiment files.
- Do not present committed experiment files as canonical docs examples.
- Do not present committed non-eval experiment files as canonical docs examples.
- Do not make child suite runtime blocks participate in parent wrapper runtime
selection.
- Do not add an authored `run_group` field.
- Do not implicitly merge parent and child workspaces for `type: suite` imports.
- Do not silently override imported suite task fields from parent suite fields.
- Do not encode source suite membership by adding redundant default result path
segments.
Expand Down
Loading
Loading