From a7d5f07d5af1c287ab80d9dd57b3ce077565939c Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 27 Jun 2026 08:55:03 +0200 Subject: [PATCH 1/2] docs(results): codify eval result identity contract --- CONCEPTS.md | 8 +- .../docs/docs/evaluation/running-evals.mdx | 28 +++-- ...arate-experiments-from-eval-definitions.md | 4 + ...-result-identity-and-default-experiment.md | 102 ++++++++++++++++++ 4 files changed, 133 insertions(+), 9 deletions(-) create mode 100644 docs/adr/0009-eval-path-result-identity-and-default-experiment.md diff --git a/CONCEPTS.md b/CONCEPTS.md index 3014bf601..bbeeefdff 100644 --- a/CONCEPTS.md +++ b/CONCEPTS.md @@ -16,9 +16,13 @@ Shared domain vocabulary for this project — entities, named processes, and sta **Experiment** — A committed run variant that selects how evals are executed: target or target matrix, setup, scripts, eval filters, repeat counts, timeouts, workers, budgets, and related run knobs. Experiments make A/B setup differences explicit while pointing at stable eval tasks. -**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `artifact_dir`, `task_dir`, `summary_path`, and `grading_path`. +**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`. -**Artifact sidecar** — A file beside or below a test-case artifact directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run. +**Result source identity** — The stable source identity for a result row: repo-relative `eval_path`, `test_id`, and `target`. `suite` and `name` are display metadata, not storage or routing identity. + +**Result directory** — The `result_dir` field in an `index.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `index.jsonl` and must not infer it from suite names, display names, test IDs, or targets. + +**Artifact sidecar** — A file beside or below a result directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run. ## Evaluation Reliability diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx index 92b577f9f..8a869a4ac 100644 --- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx @@ -11,7 +11,13 @@ sidebar: agentv eval evals/my-eval.yaml ``` -Results are written to `.agentv/results///index.jsonl`. When no experiment is defined, AgentV uses `.agentv/results/default//index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle. +Results are written to `.agentv/results///index.jsonl`. +AgentV picks the experiment bucket from `--experiment`, then +`eval.yaml` `experiment.name`, then `default`. Each CLI invocation writes one +timestamped run bundle. Each line is a JSON object with one result per test +case, and the run workspace also stores the manifest and related artifacts. Use +this generated run folder as the portable audit surface: copy or sync the run +directory, not a hand-authored parallel bundle. Each `scores[]` entry includes per-grader timing: @@ -49,14 +55,18 @@ agentv eval --target my-target evals/**/*.yaml ### Experiment Label -Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills): +Tag a run with an experiment name to track different conditions (e.g. with vs without skills): ```bash -agentv pipeline run evals/my-eval.yaml --experiment with_skills -agentv pipeline run evals/my-eval.yaml --experiment without_skills +agentv eval evals/my-eval.yaml --experiment with_skills +agentv eval evals/my-eval.yaml --experiment without_skills ``` -The experiment label is written to `manifest.json` and propagated to each entry in `index.jsonl` by `pipeline bench`. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment. +The experiment label chooses the result bucket and is propagated to each entry +in `index.jsonl`. CLI `--experiment` wins over `experiment.name` in the eval +file. If neither is set, AgentV writes to the `default` bucket. The eval file +stays the same across experiments; what changes is the runtime condition. +Dashboards can filter and compare results by experiment. ### Run Specific Test @@ -129,7 +139,7 @@ my-results/ ``` The `index.jsonl` row links to these generated paths with snake_case fields such -as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and +as `result_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path`. Treat those paths as relative to the run directory. When you need a portable artifact for audit, review, Dashboard inspection, or rerun workflows, share the generated run directory and its `index.jsonl` manifest. Source-side @@ -137,6 +147,10 @@ case directories are still useful for organizing bulky prompts, fixtures, or tests while authoring an eval, but they are optional input organization rather than a separate artifact schema. +Use repo-relative `eval_path`, `test_id`, and `target` as the source identity +for a result row. `suite` and `name` are display metadata only; do not use them +to infer storage paths or pick a Dashboard detail row. + If the source eval uses the `PROMPT.md` fallback instead of inline `input`, AgentV records the generated task bundle metadata when source artifacts are available. It no longer emits a generated prompt sidecar for result rows. @@ -443,7 +457,7 @@ See the [Import tool docs](/docs/tools/import/) for all providers and options. ## Transcript And Result Artifacts -Each result row's `artifact_dir` is a case-local folder under the timestamped +Each result row's `result_dir` is a case-local folder under the timestamped run bundle. It can include `transcript.jsonl`, `transcript-raw.jsonl`, `grading.json`, `timing.json`, `metrics.json`, and generated outputs under `outputs/`. The run root does not contain a mixed transcript artifact; use each diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md index 752495b71..8e441d8bf 100644 --- a/docs/adr/0006-separate-experiments-from-eval-definitions.md +++ b/docs/adr/0006-separate-experiments-from-eval-definitions.md @@ -9,6 +9,10 @@ Accepted Supersedes: the 2026-06-23 proposal in this file to separate experiment files from eval definitions. +Partially superseded by +[ADR 0009](0009-eval-path-result-identity-and-default-experiment.md) for result +experiment bucket precedence, result row identity, and run bundle path naming. + ## Context AgentV needs a stable authoring contract for repo-native evals, run-time knobs, diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md new file mode 100644 index 000000000..c48fb0702 --- /dev/null +++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md @@ -0,0 +1,102 @@ +# 9. Use eval_path identity and the default result experiment + +Date: 2026-06-27 + +## Status + +Accepted + +Supersedes: result naming and storage-routing portions of +[ADR 0006](0006-separate-experiments-from-eval-definitions.md) that derive run +bundle names or per-case artifact paths from eval names, suite names, or wrapper +composition. + +## Context + +AgentV needs one simple result identity contract that works for direct eval +runs, imported evals, repeated attempts, Dashboard inspection, and downstream +tools that consume portable run bundles. + +The previous same-week direction kept `eval.yaml` as the authored experiment +spec, but it still let result buckets and per-case paths be inferred from eval +or suite names. That creates unstable routing when a single CLI invocation runs +multiple eval files, imports suites with overlapping case IDs, or changes +display metadata without changing the task under evaluation. + +The final contract keeps authoring and storage separate: + +- `eval.yaml` remains the authored experiment spec. +- a CLI invocation produces one timestamped run bundle; +- per-row source identity is stored in `index.jsonl`; +- `suite` and `name` remain display metadata only; +- path discovery comes from the run manifest, not from folder conventions. + +## Decision + +One AgentV CLI invocation writes one run bundle under: + +```text +.agentv/results/// +``` + +The result experiment bucket is selected in this order: + +1. the explicit CLI `--experiment` value; +2. `eval.yaml` `experiment.name`; +3. `default`. + +`default` is the canonical bucket when neither the CLI nor the eval file names +an experiment. AgentV does not derive default experiment names from filenames, +suite names, numbers of input eval files, or multi-eval wrapper shapes. + +`eval.yaml` stays the authored experiment spec. Do not introduce +`experiment.yaml`, `experiments/default.yaml`, or `eval_root` for this pass. + +Each row in `index.jsonl` is identified by: + +```text +eval_path + test_id + target +``` + +`eval_path` is the source eval file path relative to the repo root or run +source root. Dashboard and other readers should display this value as `Eval`. +They should also display `test_id` and `target` so users can distinguish rows +with overlapping test IDs. + +`suite` and `name` are display metadata. They may help humans group or label +results, but they must not drive storage, routing, Dashboard detail selection, +rerun lookup, import identity, or artifact discovery. + +`index.jsonl` is authoritative for all run-relative artifact paths. Per-row +directories are exposed with `result_dir`. Sidecar paths such as `task_dir`, +`summary_path`, `grading_path`, `metrics_path`, `transcript_path`, +`targets_path`, `files_path`, and `graders_path` are explicit manifest fields. +Consumers must use these fields instead of reconstructing paths from +`suite`, `name`, `test_id`, or `target`. + +`result_dir` is an opaque run-local allocation. It should stay readable when +that does not compromise uniqueness, but implementations may suffix or allocate +otherwise to avoid collisions. Its value is not the public identity of the row. + +## Consequences + +- A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml` + produces one timestamped bundle unless the user explicitly runs separate CLI + commands. +- The default no-config path is stable: + `.agentv/results/default//`. +- Renaming a suite or display name does not move prior results or change + Dashboard routing identity. +- Multiple eval files can share the same `test_id` and suite display name as + long as their `eval_path` values differ. +- Import, rerun, Dashboard, comparison, and export tools can load a run from + `index.jsonl` without needing source checkout conventions. + +## Non-Goals + +- Defining an `experiment.yaml` artifact. +- Adding `eval_root`. +- Hashing eval paths into default experiment names. +- Creating automatic `multi-eval` experiment names. +- Making `result_dir` a semantic folder contract. +- Removing compatibility readers for older run bundles in this ADR. From da48f18a579fcb7c3c0b205c993e1b787b91af5c Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 27 Jun 2026 09:08:25 +0200 Subject: [PATCH 2/2] docs(results): align result directory terminology --- apps/web/src/content/docs/docs/evaluation/running-evals.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx index 8a869a4ac..f6664b336 100644 --- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx @@ -110,7 +110,7 @@ cat ./my-results/index.jsonl ### Generated Task Bundles -Each result can also include a generated task bundle inside its per-test artifact +Each result can also include a generated task bundle inside its per-test result directory. The bundle captures the eval slice and target settings that produced that row, so reviewers and rerun tooling can inspect the exact run-local source instead of relying on a mutable checkout. @@ -360,7 +360,7 @@ agentv eval evals/my-eval.yaml --retry-errors .agentv/results/default/