diff --git a/CONCEPTS.md b/CONCEPTS.md index 3014bf601..bbeeefdff 100644 --- a/CONCEPTS.md +++ b/CONCEPTS.md @@ -16,9 +16,13 @@ Shared domain vocabulary for this project — entities, named processes, and sta **Experiment** — A committed run variant that selects how evals are executed: target or target matrix, setup, scripts, eval filters, repeat counts, timeouts, workers, budgets, and related run knobs. Experiments make A/B setup differences explicit while pointing at stable eval tasks. -**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `artifact_dir`, `task_dir`, `summary_path`, and `grading_path`. +**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`. -**Artifact sidecar** — A file beside or below a test-case artifact directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run. +**Result source identity** — The stable source identity for a result row: repo-relative `eval_path`, `test_id`, and `target`. `suite` and `name` are display metadata, not storage or routing identity. + +**Result directory** — The `result_dir` field in an `index.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `index.jsonl` and must not infer it from suite names, display names, test IDs, or targets. + +**Artifact sidecar** — A file beside or below a result directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run. ## Evaluation Reliability diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx index 92b577f9f..f6664b336 100644 --- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx @@ -11,7 +11,13 @@ sidebar: agentv eval evals/my-eval.yaml ``` -Results are written to `.agentv/results///index.jsonl`. When no experiment is defined, AgentV uses `.agentv/results/default//index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle. +Results are written to `.agentv/results///index.jsonl`. +AgentV picks the experiment bucket from `--experiment`, then +`eval.yaml` `experiment.name`, then `default`. Each CLI invocation writes one +timestamped run bundle. Each line is a JSON object with one result per test +case, and the run workspace also stores the manifest and related artifacts. Use +this generated run folder as the portable audit surface: copy or sync the run +directory, not a hand-authored parallel bundle. Each `scores[]` entry includes per-grader timing: @@ -49,14 +55,18 @@ agentv eval --target my-target evals/**/*.yaml ### Experiment Label -Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills): +Tag a run with an experiment name to track different conditions (e.g. with vs without skills): ```bash -agentv pipeline run evals/my-eval.yaml --experiment with_skills -agentv pipeline run evals/my-eval.yaml --experiment without_skills +agentv eval evals/my-eval.yaml --experiment with_skills +agentv eval evals/my-eval.yaml --experiment without_skills ``` -The experiment label is written to `manifest.json` and propagated to each entry in `index.jsonl` by `pipeline bench`. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment. +The experiment label chooses the result bucket and is propagated to each entry +in `index.jsonl`. CLI `--experiment` wins over `experiment.name` in the eval +file. If neither is set, AgentV writes to the `default` bucket. The eval file +stays the same across experiments; what changes is the runtime condition. +Dashboards can filter and compare results by experiment. ### Run Specific Test @@ -100,7 +110,7 @@ cat ./my-results/index.jsonl ### Generated Task Bundles -Each result can also include a generated task bundle inside its per-test artifact +Each result can also include a generated task bundle inside its per-test result directory. The bundle captures the eval slice and target settings that produced that row, so reviewers and rerun tooling can inspect the exact run-local source instead of relying on a mutable checkout. @@ -129,7 +139,7 @@ my-results/ ``` The `index.jsonl` row links to these generated paths with snake_case fields such -as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and +as `result_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path`. Treat those paths as relative to the run directory. When you need a portable artifact for audit, review, Dashboard inspection, or rerun workflows, share the generated run directory and its `index.jsonl` manifest. Source-side @@ -137,6 +147,10 @@ case directories are still useful for organizing bulky prompts, fixtures, or tests while authoring an eval, but they are optional input organization rather than a separate artifact schema. +Use repo-relative `eval_path`, `test_id`, and `target` as the source identity +for a result row. `suite` and `name` are display metadata only; do not use them +to infer storage paths or pick a Dashboard detail row. + If the source eval uses the `PROMPT.md` fallback instead of inline `input`, AgentV records the generated task bundle metadata when source artifacts are available. It no longer emits a generated prompt sidecar for result rows. @@ -346,7 +360,7 @@ agentv eval evals/my-eval.yaml --retry-errors .agentv/results/default/// +``` + +The result experiment bucket is selected in this order: + +1. the explicit CLI `--experiment` value; +2. `eval.yaml` `experiment.name`; +3. `default`. + +`default` is the canonical bucket when neither the CLI nor the eval file names +an experiment. AgentV does not derive default experiment names from filenames, +suite names, numbers of input eval files, or multi-eval wrapper shapes. + +`eval.yaml` stays the authored experiment spec. Do not introduce +`experiment.yaml`, `experiments/default.yaml`, or `eval_root` for this pass. + +Each row in `index.jsonl` is identified by: + +```text +eval_path + test_id + target +``` + +`eval_path` is the source eval file path relative to the repo root or run +source root. Dashboard and other readers should display this value as `Eval`. +They should also display `test_id` and `target` so users can distinguish rows +with overlapping test IDs. + +`suite` and `name` are display metadata. They may help humans group or label +results, but they must not drive storage, routing, Dashboard detail selection, +rerun lookup, import identity, or artifact discovery. + +`index.jsonl` is authoritative for all run-relative artifact paths. Per-row +directories are exposed with `result_dir`. Sidecar paths such as `task_dir`, +`summary_path`, `grading_path`, `metrics_path`, `transcript_path`, +`targets_path`, `files_path`, and `graders_path` are explicit manifest fields. +Consumers must use these fields instead of reconstructing paths from +`suite`, `name`, `test_id`, or `target`. + +`result_dir` is an opaque run-local allocation. It should stay readable when +that does not compromise uniqueness, but implementations may suffix or allocate +otherwise to avoid collisions. Its value is not the public identity of the row. + +## Consequences + +- A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml` + produces one timestamped bundle unless the user explicitly runs separate CLI + commands. +- The default no-config path is stable: + `.agentv/results/default//`. +- Renaming a suite or display name does not move prior results or change + Dashboard routing identity. +- Multiple eval files can share the same `test_id` and suite display name as + long as their `eval_path` values differ. +- Import, rerun, Dashboard, comparison, and export tools can load a run from + `index.jsonl` without needing source checkout conventions. + +## Non-Goals + +- Defining an `experiment.yaml` artifact. +- Adding `eval_root`. +- Hashing eval paths into default experiment names. +- Creating automatic `multi-eval` experiment names. +- Making `result_dir` a semantic folder contract. +- Removing compatibility readers for older run bundles in this ADR.