Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 31 additions & 13 deletions docs/adr/0006-separate-experiments-from-eval-definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,9 @@ The final design keeps the product boundary smaller:
- `eval.yaml` is the only runnable authoring artifact.
- `experiment:` is an inline run-time block inside `eval.yaml`.
- `tests:` is the composition, import, and selection surface.
- result bundles are written under `.agentv/results/<eval-name>/<timestamp>/`.
- result invocations are written under
`.agentv/results/<experiment>/<timestamp>/`, with target/variant bundles below
the timestamp.
- A directory named `experiments/` may be used as a user-owned repo convention
for wrapper eval YAML files, but it does not create a separate experiment
artifact type or schema-significant path.
Expand Down Expand Up @@ -371,28 +373,44 @@ This is the motivating distinction:

## Result Layout

The canonical writer path is:
ADR 0009 is the source of truth for result experiment bucket precedence,
target/variant bundle layout, row identity, and sidecar path allocation. The
canonical invocation directory is:

```text
.agentv/results/<eval-name>/<timestamp>/...
.agentv/results/<experiment>/<timestamp>/...
```

There is no `.agentv/results/runs` segment in canonical writer output. There is
also no default nested suite segment when the result group is already the same
eval suite being run directly.
also no schema-significant suite or imported-suite directory segment.

If a wrapper eval imports another suite with `type: suite`, test artifacts from
that imported suite are nested under the imported suite identity:
Within the timestamp, new writers fan out into a target and optional variant
bundle. Each target/variant bundle has its own `index.jsonl` and `summary.json`:

```text
.agentv/results/<wrapper-eval-name>/<timestamp>/<imported-suite-name>/<test-id>/...
.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
index.jsonl
summary.json
<row_id>/run-1/
<row_id>/run-2/
```

The suite segment is required for imported suites because wrapper evals can
compose many suites with overlapping test IDs, and the directory tree should
remain inspectable without reading every manifest row. Test artifacts from tests
owned directly by the wrapper eval can still live directly under `<test-id>`.
All cases should also retain source suite metadata in manifests and index rows.
The target/variant folder split is for storage isolation and manual browsing
only. Dashboard, import, rerun, comparison, and export readers discover nested
`index.jsonl` files, then use bundle summary and row metadata for `target`,
`variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy timestamp-level
bundles with `index.jsonl` directly under
`.agentv/results/<experiment>/<timestamp>/` remain readable.

Inside each target/variant bundle, `index.jsonl` is authoritative for row
identity and all bundle-relative sidecar paths. `row_id` directories are stable,
filesystem-safe allocations such as `<safe_test_id>--<short_hash>`, and repeated
runs live under `run-1`, `run-2`, and so on. The hash input includes the source
eval identity or `eval_path`, suite label, `test_id`, target, and variant. This
keeps same-`test_id` rows from different suites, duplicate suite labels, targets,
and variants from overwriting each other without making path hierarchy a
semantic contract. There is no required `rows/` parent directory. Source suite
metadata still belongs in manifests and index rows.

The result namespace remains `experiment` in artifacts and Dashboard. AgentV
should not introduce a separate authored `run_group` field. For better DX,
Expand Down
124 changes: 106 additions & 18 deletions docs/adr/0009-eval-path-result-identity-and-default-experiment.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,31 +14,64 @@ composition.
## Context

AgentV needs one simple result identity contract that works for direct eval
runs, imported evals, repeated attempts, Dashboard inspection, and downstream
tools that consume portable run bundles.
runs, imported evals, repeated runs, Dashboard inspection, and downstream tools
that consume portable run bundles.

The previous same-week direction kept `eval.yaml` as the authored experiment
spec, but it still let result buckets and per-case paths be inferred from eval
or suite names. That creates unstable routing when a single CLI invocation runs
multiple eval files, imports suites with overlapping case IDs, or changes
display metadata without changing the task under evaluation.

Follow-up dogfood in bead `av-770` found a concrete bug in that direction:
multi-target AgentV runs stored different targets for the same `test_id` under
the same `<case>/run-1` sidecar directory. The second target overwrote the
first target's output, grading, timing, metrics, and case summary artifacts.
Related research in beads `av-74h` and `av-e49` compared Vercel `agent-eval`,
Vercel `next-evals-oss`, and Margin Evals. Those systems confirm that
frameworks can either encode model/variant in experiment names or keep run
manifests as the truth, but none requires AgentV to derive artifact paths from
suite names.

The final contract keeps authoring and storage separate:

- `eval.yaml` remains the authored experiment spec.
- a CLI invocation produces one timestamped run bundle;
- a CLI invocation produces one timestamped invocation directory;
- each target and optional variant in that invocation gets an isolated result
bundle under the timestamp;
- per-row source identity is stored in `index.jsonl`;
- `suite` and `name` remain display metadata only;
- path discovery comes from the run manifest, not from folder conventions.
- per-row sidecar directories are stable storage allocations, not semantic
routing keys.

## Decision

One AgentV CLI invocation writes one run bundle under:
One AgentV CLI invocation writes one timestamped invocation directory under:

```text
.agentv/results/<experiment>/<timestamp>/
```

`experiment` remains the campaign namespace. The timestamp is the
invocation/batch folder for that CLI run. Within the timestamp, new writers fan
out into one bundle per target and optional variant:

```text
.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
index.jsonl
summary.json
<row_id>/run-1/
<row_id>/run-2/
```

The `<target>/<variant?>` folder split is only storage isolation and manual
browsing structure. It is not the semantic source for target or variant. Readers
discover nested `index.jsonl` files, then use loaded summary and row metadata for
`target`, `variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy
timestamp-level bundles that keep `index.jsonl` and `summary.json` directly
under `.agentv/results/<experiment>/<timestamp>/` remain readable.

The result experiment bucket is selected in this order:

1. the explicit CLI `--experiment` value;
Expand All @@ -55,42 +88,96 @@ suite names, numbers of input eval files, or multi-eval wrapper shapes.
Each row in `index.jsonl` is identified by:

```text
eval_path + test_id + target
eval_path + test_id + target + variant
```

`eval_path` is the source eval file path relative to the repo root or run
source root. Dashboard and other readers should display this value as `Eval`.
They should also display `test_id` and `target` so users can distinguish rows
with overlapping test IDs.
They should also display `test_id`, `target`, and `variant` when present so
users can distinguish rows with overlapping test IDs.

`suite` and `name` are display metadata. They may help humans group or label
results, but they must not drive storage, routing, Dashboard detail selection,
rerun lookup, import identity, or artifact discovery.
results, and `suite` may participate in opaque row-id collision avoidance, but
they must not drive visible storage hierarchy, semantic routing, Dashboard
detail selection, rerun lookup, import identity, or artifact discovery.

`index.jsonl` is authoritative for all run-relative artifact paths. Per-row
`index.jsonl` is authoritative for all bundle-relative artifact paths. Per-row
directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,
`summary_path`, `grading_path`, `metrics_path`, `transcript_path`,
`targets_path`, `files_path`, and `graders_path` are explicit manifest fields.
Consumers must use these fields instead of reconstructing paths from
`suite`, `name`, `test_id`, or `target`.
`suite`, `name`, `test_id`, `target`, `variant`, or target/variant folder names.

`result_dir` is an opaque bundle-local allocation. For newly written artifacts,
the preferred allocation is a deterministic row directory inside the
target/variant bundle:

```text
.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
index.jsonl
summary.json
<row_id>/run-1/
<row_id>/run-2/
```

There is no required `rows/` parent directory. `row_id` should be stable,
filesystem-safe, compact, and readable enough for humans to scan, for example:

```text
<safe_test_id>--<short_hash>
```

`result_dir` is an opaque run-local allocation. It should stay readable when
that does not compromise uniqueness, but implementations may suffix or allocate
otherwise to avoid collisions. Its value is not the public identity of the row.
The visible `test_id` prefix is only a convenience. The hash input must include
the collision-prone row fields available at write time: `eval_path` or source
eval identity, `suite` label, `test_id`, `target`, and `variant`. `eval_path`
or equivalent source identity is what prevents duplicate suite names from
colliding; the suite label alone is never a uniqueness boundary. If future row
identity gains another axis, that axis must be included in the hash before it
can affect sidecar allocation.

This row-id allocation is intentionally simpler than conditional path
disambiguation. It avoids special cases for same `test_id` across suites,
duplicate suite labels, multi-target runs, and target variants while keeping the
multi-target CLI under one timestamped invocation directory. Existing run
bundles remain readable because `index.jsonl` already records explicit artifact
paths; any consumer that infers `<case>/run-1` paths or semantic target
information from folders instead of following `index.jsonl` is depending on an
implementation detail and should be fixed.

Reference alternatives considered:

- Vercel `agent-eval` expands model arrays into experiment paths such as
`<config>/<model>/<timestamp>/<case>/run-1`. This works for
model-as-experiment publication but fragments one multi-target invocation and
lets provider names with slashes become path hierarchy.
- Vercel `next-evals-oss` uses one experiment file per model and `--agents-md`
variant, then pairs variants during export. AgentV should allow that style by
experiment naming for published baselines, but AgentV remains a superset of
Vercel-style naming and must not hard-code Next or Vercel semantics for
ordinary multi-target runs.
- Margin Evals writes one output run directory with result manifests and
instance artifacts, without an AgentV-style experiment bucket. That validates
manifest-first storage, but dropping AgentV's experiment bucket is a larger
semantic change than this bug fix needs.

## Consequences

- A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml`
produces one timestamped bundle unless the user explicitly runs separate CLI
commands.
produces one timestamped invocation directory unless the user explicitly runs
separate CLI commands.
- The default no-config path is stable:
`.agentv/results/default/<timestamp>/`.
- Renaming a suite or display name does not move prior results or change
Dashboard routing identity.
- Multiple eval files can share the same `test_id` and suite display name as
long as their `eval_path` values differ.
- Import, rerun, Dashboard, comparison, and export tools can load a run from
`index.jsonl` without needing source checkout conventions.
- Import, rerun, Dashboard, comparison, and export tools can load runs by
discovering nested `index.jsonl` manifests without needing source checkout
conventions or folder-name semantics.
- Multi-target and variant runs do not need to become multiple experiments or
separate timestamped invocations just to avoid sidecar collisions.
- New sidecar paths may not resemble the case hierarchy, which is acceptable
because `index.jsonl` is the contract for discovery and display.

## Non-Goals

Expand All @@ -99,4 +186,5 @@ otherwise to avoid collisions. Its value is not the public identity of the row.
- Hashing eval paths into default experiment names.
- Creating automatic `multi-eval` experiment names.
- Making `result_dir` a semantic folder contract.
- Adding a `rows/` directory segment without a concrete implementation need.
- Removing compatibility readers for older run bundles in this ADR.
Loading