Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions CONCEPTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,13 @@ Shared domain vocabulary for this project — entities, named processes, and sta

**Experiment** — A committed run variant that selects how evals are executed: target or target matrix, setup, scripts, eval filters, repeat counts, timeouts, workers, budgets, and related run knobs. Experiments make A/B setup differences explicit while pointing at stable eval tasks.

**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `artifact_dir`, `task_dir`, `summary_path`, and `grading_path`.
**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`.

**Artifact sidecar** — A file beside or below a test-case artifact directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.
**Result source identity** — The stable source identity for a result row: repo-relative `eval_path`, `test_id`, and `target`. `suite` and `name` are display metadata, not storage or routing identity.

**Result directory** — The `result_dir` field in an `index.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `index.jsonl` and must not infer it from suite names, display names, test IDs, or targets.

**Artifact sidecar** — A file beside or below a result directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.

## Evaluation Reliability

Expand Down
32 changes: 23 additions & 9 deletions apps/web/src/content/docs/docs/evaluation/running-evals.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,13 @@ sidebar:
agentv eval evals/my-eval.yaml
```

Results are written to `.agentv/results/<experiment>/<timestamp>/index.jsonl`. When no experiment is defined, AgentV uses `.agentv/results/default/<timestamp>/index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle.
Results are written to `.agentv/results/<experiment>/<timestamp>/index.jsonl`.
AgentV picks the experiment bucket from `--experiment`, then
`eval.yaml` `experiment.name`, then `default`. Each CLI invocation writes one
timestamped run bundle. Each line is a JSON object with one result per test
case, and the run workspace also stores the manifest and related artifacts. Use
this generated run folder as the portable audit surface: copy or sync the run
directory, not a hand-authored parallel bundle.

Each `scores[]` entry includes per-grader timing:

Expand Down Expand Up @@ -49,14 +55,18 @@ agentv eval --target my-target evals/**/*.yaml

### Experiment Label

Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills):
Tag a run with an experiment name to track different conditions (e.g. with vs without skills):

```bash
agentv pipeline run evals/my-eval.yaml --experiment with_skills
agentv pipeline run evals/my-eval.yaml --experiment without_skills
agentv eval evals/my-eval.yaml --experiment with_skills
agentv eval evals/my-eval.yaml --experiment without_skills
```

The experiment label is written to `manifest.json` and propagated to each entry in `index.jsonl` by `pipeline bench`. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment.
The experiment label chooses the result bucket and is propagated to each entry
in `index.jsonl`. CLI `--experiment` wins over `experiment.name` in the eval
file. If neither is set, AgentV writes to the `default` bucket. The eval file
stays the same across experiments; what changes is the runtime condition.
Dashboards can filter and compare results by experiment.

### Run Specific Test

Expand Down Expand Up @@ -100,7 +110,7 @@ cat ./my-results/index.jsonl

### Generated Task Bundles

Each result can also include a generated task bundle inside its per-test artifact
Each result can also include a generated task bundle inside its per-test result
directory. The bundle captures the eval slice and target settings that produced
that row, so reviewers and rerun tooling can inspect the exact run-local source
instead of relying on a mutable checkout.
Expand Down Expand Up @@ -129,14 +139,18 @@ my-results/
```

The `index.jsonl` row links to these generated paths with snake_case fields such
as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
as `result_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
`graders_path`. Treat those paths as relative to the run directory. When you need
a portable artifact for audit, review, Dashboard inspection, or rerun workflows,
share the generated run directory and its `index.jsonl` manifest. Source-side
case directories are still useful for organizing bulky prompts, fixtures, or
tests while authoring an eval, but they are optional input organization rather
than a separate artifact schema.

Use repo-relative `eval_path`, `test_id`, and `target` as the source identity
for a result row. `suite` and `name` are display metadata only; do not use them
to infer storage paths or pick a Dashboard detail row.

If the source eval uses the `PROMPT.md` fallback instead of inline `input`,
AgentV records the generated task bundle metadata when source artifacts are
available. It no longer emits a generated prompt sidecar for result rows.
Expand Down Expand Up @@ -346,7 +360,7 @@ agentv eval evals/my-eval.yaml --retry-errors .agentv/results/default/<timestamp

After any failing run, the CLI prints the exact `--rerun-failed` command for the run dir that just completed — copy/paste it. If the process or pod disappeared before you could access the local run directory and results auto-push was enabled, recover the partial run from [WIP checkpoints](/docs/tools/wip-checkpoints/) first, then use the same `--resume` flow.

The interactive wizard (`agentv eval` with no arguments) remembers the last run's artifact directory and surfaces a **"Resume last run"** entry in the main menu when one exists.
The interactive wizard (`agentv eval` with no arguments) remembers the last run directory and surfaces a **"Resume last run"** entry in the main menu when one exists.

### Execution Error Tolerance

Expand Down Expand Up @@ -443,7 +457,7 @@ See the [Import tool docs](/docs/tools/import/) for all providers and options.

## Transcript And Result Artifacts

Each result row's `artifact_dir` is a case-local folder under the timestamped
Each result row's `result_dir` is a case-local folder under the timestamped
run bundle. It can include `transcript.jsonl`, `transcript-raw.jsonl`,
`grading.json`, `timing.json`, `metrics.json`, and generated outputs under
`outputs/`. The run root does not contain a mixed transcript artifact; use each
Expand Down
4 changes: 4 additions & 0 deletions docs/adr/0006-separate-experiments-from-eval-definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ Accepted
Supersedes: the 2026-06-23 proposal in this file to separate experiment files
from eval definitions.

Partially superseded by
[ADR 0009](0009-eval-path-result-identity-and-default-experiment.md) for result
experiment bucket precedence, result row identity, and run bundle path naming.

## Context

AgentV needs a stable authoring contract for repo-native evals, run-time knobs,
Expand Down
102 changes: 102 additions & 0 deletions docs/adr/0009-eval-path-result-identity-and-default-experiment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# 9. Use eval_path identity and the default result experiment

Date: 2026-06-27

## Status

Accepted

Supersedes: result naming and storage-routing portions of
[ADR 0006](0006-separate-experiments-from-eval-definitions.md) that derive run
bundle names or per-case artifact paths from eval names, suite names, or wrapper
composition.

## Context

AgentV needs one simple result identity contract that works for direct eval
runs, imported evals, repeated attempts, Dashboard inspection, and downstream
tools that consume portable run bundles.

The previous same-week direction kept `eval.yaml` as the authored experiment
spec, but it still let result buckets and per-case paths be inferred from eval
or suite names. That creates unstable routing when a single CLI invocation runs
multiple eval files, imports suites with overlapping case IDs, or changes
display metadata without changing the task under evaluation.

The final contract keeps authoring and storage separate:

- `eval.yaml` remains the authored experiment spec.
- a CLI invocation produces one timestamped run bundle;
- per-row source identity is stored in `index.jsonl`;
- `suite` and `name` remain display metadata only;
- path discovery comes from the run manifest, not from folder conventions.

## Decision

One AgentV CLI invocation writes one run bundle under:

```text
.agentv/results/<experiment>/<timestamp>/
```

The result experiment bucket is selected in this order:

1. the explicit CLI `--experiment` value;
2. `eval.yaml` `experiment.name`;
3. `default`.

`default` is the canonical bucket when neither the CLI nor the eval file names
an experiment. AgentV does not derive default experiment names from filenames,
suite names, numbers of input eval files, or multi-eval wrapper shapes.

`eval.yaml` stays the authored experiment spec. Do not introduce
`experiment.yaml`, `experiments/default.yaml`, or `eval_root` for this pass.

Each row in `index.jsonl` is identified by:

```text
eval_path + test_id + target
```

`eval_path` is the source eval file path relative to the repo root or run
source root. Dashboard and other readers should display this value as `Eval`.
They should also display `test_id` and `target` so users can distinguish rows
with overlapping test IDs.

`suite` and `name` are display metadata. They may help humans group or label
results, but they must not drive storage, routing, Dashboard detail selection,
rerun lookup, import identity, or artifact discovery.

`index.jsonl` is authoritative for all run-relative artifact paths. Per-row
directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,
`summary_path`, `grading_path`, `metrics_path`, `transcript_path`,
`targets_path`, `files_path`, and `graders_path` are explicit manifest fields.
Consumers must use these fields instead of reconstructing paths from
`suite`, `name`, `test_id`, or `target`.

`result_dir` is an opaque run-local allocation. It should stay readable when
that does not compromise uniqueness, but implementations may suffix or allocate
otherwise to avoid collisions. Its value is not the public identity of the row.

## Consequences

- A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml`
produces one timestamped bundle unless the user explicitly runs separate CLI
commands.
- The default no-config path is stable:
`.agentv/results/default/<timestamp>/`.
- Renaming a suite or display name does not move prior results or change
Dashboard routing identity.
- Multiple eval files can share the same `test_id` and suite display name as
long as their `eval_path` values differ.
- Import, rerun, Dashboard, comparison, and export tools can load a run from
`index.jsonl` without needing source checkout conventions.

## Non-Goals

- Defining an `experiment.yaml` artifact.
- Adding `eval_root`.
- Hashing eval paths into default experiment names.
- Creating automatic `multi-eval` experiment names.
- Making `result_dir` a semantic folder contract.
- Removing compatibility readers for older run bundles in this ADR.
Loading