diff --git a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx index cd0f8936a..33ad6bf50 100644 --- a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx +++ b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx @@ -9,7 +9,7 @@ Benchmark suites usually need more than a prompt and a score. They carry source pins, task patches, generated dataset rows, oracle data, setup scripts, and verification commands. AgentV represents that with existing primitives: -- Put runtime behavior in `workspace`, `execution`, `input`, `expected_output`, +- Put runtime behavior in `workspace`, `experiment`, `input`, `expected_output`, and `assertions`. - Put provenance and classification in per-case `metadata`. - Put bulky per-case authoring inputs in optional case directories and supporting files. @@ -29,7 +29,7 @@ Use this split when deciding where a benchmark key belongs: | `workspace.template` | Yes | Copies a workspace template into the run workspace. | | `workspace.hooks` | Yes | Runs lifecycle commands with workspace and case context on stdin. | | `workspace.isolation`, `workspace.mode`, `workspace.path` | Yes | Controls workspace reuse and materialization. | -| `execution` | Yes | Selects targets, thresholds, dependencies, and default grader behavior. | +| `experiment` | Yes | Selects targets, thresholds, repeat policy, budgets, workers, and default grader behavior. | | `input`, `input_files`, `expected_output` | Yes | Builds the target prompt and passive reference answer. | | `assertions` | Yes | Runs deterministic, LLM, composite, or code graders. | | Top-level `name`, `version`, `tags`, `license`, `requires` | Informational | Identifies and categorizes the suite. | @@ -162,7 +162,7 @@ workspace: before_each: command: ["python", "./scripts/apply-case-fixtures.py"] -execution: +experiment: targets: [codex, claude] assertions: @@ -182,20 +182,41 @@ the imported results, and link Opik traces when Harbor uploads them. # Proposed runner boundary, not a current AgentV task schema. name: swebench-verified-codex -execution: - runner: harbor - harbor: - dataset: swebench-verified - agent: codex - model: openai/gpt-5-mini - opik: - enabled: true +experiment: + target: codex + model: openai/gpt-5-mini + runner: + type: harbor + options: + opik: + enabled: true ``` Do not translate Harbor `task.toml`, verifier packaging, or suite-specific Docker/Compose adapter fields into AgentV core eval schema. If the benchmark's runtime contract is already owned by Harbor, keep those details in Harbor and let AgentV consume the job metadata, rewards, artifacts, and trace links. +Do not add a generic top-level `source` field just to identify Harbor. If a +future Harbor adapter needs suite selection, keep that selector narrow and +adapter-owned instead of making it the AgentV workspace model. + +## Eval Composition + +When one eval references another eval, preserve the task/runtime split: + +- The parent runnable eval owns runtime `experiment:` for the run. +- Child `experiment:` blocks are ignored by `type: suite` composition. There is + no fallback to the child `experiment:` when the parent has no `experiment:`. +- Child `workspace` setup is preserved for `type: suite` imports. Parent + workspace applies to raw cases owned by the parent file, not to imported suite + tests. +- A tests-only import can drop child workspace context only when the import mode + says so explicitly. +- Workspace path collisions or incompatible isolation settings should fail + loudly if a future explicit remap mode is added. + +That rule keeps imported benchmark cases attached to their setup while still +letting a parent eval compare targets, repeat policy, and gates consistently. ## Finance-Style Generated Dataset diff --git a/docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md b/docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md index 8a13140e7..2647b2750 100644 --- a/docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md +++ b/docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md @@ -11,8 +11,9 @@ Proposed AgentV now has native workspace repository acquisition for custom evals, CI gates, target comparisons, pooled workspaces, hooks, and Docker workspace cases. That should remain generic infrastructure: `workspace.repos[].commit` is the -canonical checkout pin, and `workspace.repos[].base_commit` is only a -SWE-Bench-friendly alias for the same value. +canonical checkout pin. SWE-Bench `base_commit` is upstream/import vocabulary +that adapters may translate into `commit`; it should not become the canonical +AgentV workspace field. Harbor owns benchmark-grade execution for standard suites such as SWE-Bench Verified, Multi-SWE-Bench, Terminal-Bench, and suites with Harbor-specific @@ -39,101 +40,79 @@ Harbor should own: - Harbor `task.toml` files and Harbor YAML config; - Opik trace upload through Harbor when enabled. -## Alignment with experiment separation +## Alignment with inline experiment runtime -The 2026-06-23 experiment/eval separation decision makes runtime binding an -experiment concern. Harbor execution should follow the same split: +The 2026-06-26 inline experiment decision keeps runtime binding inside the +single runnable `eval.yaml` artifact. Harbor execution should follow the same +split: - AgentV eval YAML remains the authoring or selection surface for what benchmark suite is being evaluated. -- AgentV experiment YAML selects or pins the Harbor runner, candidate +- The inline `experiment:` block selects or pins the Harbor runner, candidate agent/model, run policy, and other runtime binding. - Harbor-authored YAML remains Harbor's own config surface when the standard suite needs Harbor-specific task packaging or verifier settings. This means the examples below show the desired logical fields, but new -runtime fields should be placed on an experiment unless they are genuinely part -of the benchmark suite identity. Do not put candidate agent/model binding in the -eval file for new AgentV-native examples. +runtime fields should be placed under `experiment:` unless they are genuinely +part of the benchmark suite identity. Do not put candidate agent/model binding +under `source` for new AgentV-native examples. ## Minimal future config surface -An AgentV eval suite can select the benchmark source without copying Harbor's -task schema or claiming to be the runtime runner: +An AgentV eval suite should not gain a generic top-level `source` field just to +select Harbor. ADR 0009 keeps benchmark-shaped evals on existing primitives: +workspace setup belongs in `workspace`, runtime binding belongs in +`experiment`, and imported benchmark provenance belongs in metadata, sidecars, +or adapter manifests. -```yaml -name: swebench-verified - -source: - type: harbor - dataset: swebench-verified -``` - -The corresponding experiment selects how that suite runs: +If a future Harbor adapter needs first-class selection in AgentV YAML, it should +be designed as a narrow runner/import selector after real usage, not as a broad +benchmark source schema. The corresponding inline experiment block still selects +how that suite runs: ```yaml name: swebench-verified-codex -target: codex-gpt5-mini -evals: evals/swebench-verified.eval.yaml -runner: - type: harbor - options: - opik: - enabled: true -``` -For a Harbor-authored YAML file, use `config` instead of `dataset`: - -```yaml -source: - type: harbor - config: ./harbor/swebench-verified.yaml +experiment: + target: codex + model: openai/gpt-5-mini + runner: + type: harbor + options: + opik: + enabled: true ``` -The first implementation should accept exactly one Harbor source selector: -`dataset` for a known Harbor dataset id, or `config` for an existing Harbor YAML -file. There should be no precedence rule between them. If both are set, fail -validation and ask the user to choose one. - -Do not combine Harbor suite selection with candidate binding in the eval file: +Do not combine Harbor suite selection with candidate binding in an invented +source block: ```yaml # Avoid in eval.yaml -execution: +source: runner: harbor - harbor: - dataset: swebench-verified - agent: codex - model: openai/gpt-5-mini + dataset: swebench-verified + agent: codex + model: openai/gpt-5-mini ``` -Split that shape across the suite and experiment instead: +Keep runtime binding in `experiment:` instead: ```yaml -# evals/swebench-verified.eval.yaml name: swebench-verified -source: - type: harbor - dataset: swebench-verified -``` -```yaml -# experiments/swebench-verified-codex.yaml -name: swebench-verified-codex -target: codex -model: openai/gpt-5-mini -evals: evals/swebench-verified.eval.yaml -runner: - type: harbor +experiment: + target: codex + model: openai/gpt-5-mini + runner: + type: harbor ``` -Keep Harbor suite source selection under `source` in the eval suite. Keep -experiment-side runner selection under `runner.type`, with runner knobs under -`runner.options`. The eval suite answers "where do these cases come from?"; the -experiment answers "how is this run executed?" Do not use `execution.runner` in -new eval-suite examples because that name collides with the experiment runner. -Do not repeat the runner discriminator as `runner.harbor.options`; `type: -harbor` already provides that namespace. +Keep runtime runner selection under `experiment.runner.type`, with runner knobs +under `experiment.runner.options`. Do not use `execution.runner` in new +eval-suite examples because top-level `execution:` is only a legacy alias for +old eval files. Do not repeat the runner discriminator as +`runner.harbor.options`; `type: harbor` already provides that namespace. Do not add top-level AgentV fields for Harbor task packaging, verifier images, task patches, or Docker/Compose adapter settings. If a Harbor option becomes too @@ -153,8 +132,9 @@ agentv eval evals/native.eval.yaml --target codex ``` Harbor-backed evals should use the same top-level entrypoint. If no explicit -experiment runner is configured, AgentV may infer Harbor execution from -`source.type: harbor`: +experiment runner is configured, a future adapter may infer Harbor execution +from an adapter-owned manifest or CLI flag, but this ADR does not add that +schema: ```bash agentv eval evals/swebench-harbor.eval.yaml @@ -176,9 +156,9 @@ agentv results import harbor --job ``` Do not overload native `--target` semantics in the first Harbor runner slice. -Harbor `agent`, `model`, and matrix behavior should come from the experiment or -the referenced Harbor YAML until repeated usage proves a shared AgentV flag is -needed. +Harbor `agent`, `model`, and matrix behavior should come from inline +`experiment:` runtime fields or the referenced Harbor YAML until repeated usage +proves a shared AgentV flag is needed. ## Unsupported fields and non-goals @@ -191,6 +171,8 @@ The Harbor runner mode should not add or interpret: `base_commit` as Harbor runner inputs; - generic `extra_args` or arbitrary pass-through maps in the initial AgentV surface. +- generic top-level `source` as a replacement for AgentV `workspace` or + metadata conventions. These fields remain valid in native AgentV evals when authors compose their own workspace, hooks, and graders. They are non-goals only for the Harbor-backed @@ -199,9 +181,9 @@ standard-suite path. ## Implementation sequencing 1. Document the native-vs-Harbor boundary and commit alias rules. -2. Add schema validation for eval-suite `source.type: harbor` and exactly one of - `source.dataset` or `source.config`, plus experiment `runner.type` and - `runner.options`, with no changes to native workspace acquisition. +2. Add a narrow Harbor runner/import selector only after repeated usage proves + it is needed; keep inline `experiment.runner.type` and + `experiment.runner.options`, with no changes to native workspace acquisition. 3. Add a Harbor launch adapter that records job identity and status. 4. Add a Harbor result importer that maps rewards, exceptions, timings, artifacts, and Opik trace URLs into AgentV run bundles. diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md index 752495b71..8e1c84ba4 100644 --- a/docs/adr/0006-separate-experiments-from-eval-definitions.md +++ b/docs/adr/0006-separate-experiments-from-eval-definitions.md @@ -28,6 +28,9 @@ The final design keeps the product boundary smaller: - `experiment:` is an inline run-time block inside `eval.yaml`. - `tests:` is the composition, import, and selection surface. - result bundles are written under `.agentv/results///`. +- A directory named `experiments/` may be used as a user-owned repo convention + for wrapper eval YAML files, but it does not create a separate experiment + artifact type or schema-significant path. This keeps AgentV repo-native and zero-infra by default, avoids a new public artifact type, and still lets wrapper evals run multiple imported suites with a @@ -45,7 +48,10 @@ Do not introduce or document: - committed experiment files as the canonical authoring path The only runnable authoring artifact is `eval.yaml` or another `*.eval.yaml` -file. Runtime controls live in an inline `experiment:` block: +file. A project may place wrapper eval files under an `experiments/` directory +when their main job is to bind runtime policy over reusable suites, but those +files are still ordinary eval YAML files. AgentV must not infer behavior from +that directory name. Runtime controls live in an inline `experiment:` block: ```yaml name: cargowise-sql-migration-codex @@ -103,6 +109,28 @@ The old experiment runtime fields are ported into the parent eval file: Suite or case workspace fields remain task-owned when they define what is being evaluated. +## Contract Layers + +Parent-versus-child is not the main composition rule. Contract ownership is: + +| Layer | AgentV fields | Owner in `type: suite` composition | +| --- | --- | --- | +| Task data | `tests`, case `metadata`, `expected_output` | Imported child suite | +| Task prompt | `input`, `input_files`, shared prompt defaults | Imported child suite | +| Task environment | `workspace`, `workspace.repos[]`, templates, workspace hooks | Imported child suite | +| Scoring | `assertions`, graders, expected references | Imported child suite | +| Run policy | `experiment`, CLI target flags, workers, repeat, gates, budget | Parent wrapper eval or CLI | +| Target runtime | selected target config and `targets[].hooks` | Selected target | + +`workspace` can influence what an agent perceives through tools, but it is not +prompt input. `input` is what the agent is told; `workspace` is what the agent +can act on; `assertions` are how AgentV judges the result; `experiment` is how +the run is bound, repeated, compared, and gated. + +This framing removes the apparent "child wins except experiment" exception. +Child suites own task contracts. Parent wrapper evals and CLI flags own run +contracts. + ## Lifecycle Ownership `experiment:` owns evaluation policy, not lifecycle mutation. Commands that @@ -219,6 +247,25 @@ fields. Explicit override syntax can be considered later if a concrete use case needs it, but the default composition model must not merge task contracts in a surprising way. +If a parent eval defines `workspace` and imports child eval suites with +`type: suite`, the parent workspace applies only to raw cases owned by the +parent file. Imported suite tests keep their child suite workspace. This is a +valid mixed-case pattern when the parent owns raw cases, but it is usually a +DX smell when every test is a `type: suite` import. AgentV should warn or lint +that shape rather than silently implying a parent workspace override. + +If a parent eval has no `experiment:` and imports child suites that do have +`experiment:` blocks, child runtime still does not fall back into the parent +run. AgentV should warn because authors often expect the child runtime to be +used. The correct choices are to run the child suite directly, add a parent +`experiment:` block, or pass CLI runtime flags. + +Wrapper evals that import multiple suites with distinct shared workspace +contracts should fail fast or require per-test isolation, separate runs, or an +explicit future composition mode. Shared workspace setup is safe when one suite +owns the task contract; it is not a place for implicit parent-child or +child-child workspace merging. + ## Runtime Overrides The parent `experiment:` block is the default runtime policy for the whole eval. @@ -330,6 +377,13 @@ remain inspectable without reading every manifest row. Test artifacts from tests owned directly by the wrapper eval can still live directly under ``. All cases should also retain source suite metadata in manifests and index rows. +The result namespace remains `experiment` in artifacts and Dashboard. AgentV +should not introduce a separate authored `run_group` field. For better DX, +Dashboard and reports may derive display-only runtime source labels such as +`inline experiment`, `CLI`, `defaults`, or `mixed`, and may show the top-level +eval file plus imported suites. Those labels are explanations over existing +primitives, not new configuration surface. + ## Consequences Positive: @@ -350,14 +404,22 @@ Negative: evals. - Explicit task-context override syntax is deferred, so authors who need overrides must create a new suite or wait for a focused override design. +- Wrapper evals need diagnostics so authors understand that parent workspace + does not override imported suite workspaces and child experiment blocks are + ignored. ## Non-Goals -- Do not add separate `experiment.yaml` files or an `experiments/` convention. +- Do not add separate `experiment.yaml` files. +- Do not make `experiments/` a schema-significant directory or separate + artifact type; it may only be a repo layout convention for ordinary wrapper + eval YAML files. - Do not add config pointers to external experiment files. -- Do not present committed experiment files as canonical docs examples. +- Do not present committed non-eval experiment files as canonical docs examples. - Do not make child suite runtime blocks participate in parent wrapper runtime selection. +- Do not add an authored `run_group` field. +- Do not implicitly merge parent and child workspaces for `type: suite` imports. - Do not silently override imported suite task fields from parent suite fields. - Do not encode source suite membership by adding redundant default result path segments. diff --git a/docs/adr/0009-keep-benchmark-schema-on-existing-primitives.md b/docs/adr/0009-keep-benchmark-schema-on-existing-primitives.md new file mode 100644 index 000000000..f94ea687f --- /dev/null +++ b/docs/adr/0009-keep-benchmark-schema-on-existing-primitives.md @@ -0,0 +1,177 @@ +# 9. Keep benchmark schema on existing primitives + +Date: 2026-06-27 + +## Status + +Proposed + +## Context + +Research for bead `av-2h9` compared AgentV with SWE-bench, SWE-bench Verified, +Harbor, Margin Evals, Vercel `agent-eval`, OpenAI Evals, Inspect, Braintrust, +promptfoo, LangSmith, Hugging Face Datasets, and OpenInference. + +Those systems converge on stable case identity, dataset splits, repo or fixture +provenance, expected/reference data, executable graders, repeat policy, result +identity, and portable artifacts. They do not converge on a shared runner or +workspace schema. + +AgentV already has the core primitives needed for that lowest-common +denominator: + +- `tests[].id` and `tests[].metadata` for case identity and imported row + provenance; +- `workspace.repos[]`, templates, hooks, and isolation for operational setup; +- inline `experiment:` for target binding, repeat policy, gates, and runtime + knobs; +- `expected_output`, assertions, and code graders for reference data and hidden + verification; +- AgentV run bundles as the artifact source of truth. + +The important AgentV difference is workspace composition. Public coding-agent +benchmarks usually assume one repo, one fixture directory, or one container. +AgentV can materialize multiple repositories into one eval workspace. A new +generic `source` block would either duplicate `workspace.repos[]` or become +non-operational metadata. Neither case justifies more core schema. + +## Decision + +Do not add a new top-level `source` field from this research. Do not rename +`workspace.repos[].commit` to `base_commit`. + +AgentV should continue to model benchmark-shaped evals with existing +primitives: + +- Repository acquisition and checkout pins stay in `workspace.repos[]`. +- `workspace.repos[].commit` remains the canonical checkout ref field. +- `base_commit` is a SWE-bench import or compatibility alias only if an adapter + needs to preserve upstream vocabulary; it should not become the canonical + hand-authored AgentV field. +- Runtime policy stays in inline `experiment:`. +- Target-specific setup remains in target hooks, workspace hooks, assertions, + and code graders. +- Benchmark row details stay in `tests[].metadata`, source-owned sidecars, or + adapter-generated manifests. +- External benchmark runners such as Harbor stay behind runner/import + boundaries. + +The same minimal-primitives rule applies to experiment display and repo layout. +AgentV should not add an authored `run_group` field or revive separate +`experiment.yaml` files. The existing `experiment` namespace remains the +artifact and Dashboard grouping primitive. Projects may use an `experiments/` +directory for wrapper eval YAML files, but that directory is only a +documentation and repo-organization convention. AgentV must not infer behavior +from the folder name. + +For eval composition, the parent runnable eval owns runtime policy. If a parent +references child eval files with `type: suite`, the current loader ignores the +child `experiment:` block and uses the parent `experiment:` when one exists; it +does not fall back to the child `experiment:`. Workspace follows task ownership, +not runtime fallback: imported child tests keep the child suite workspace that +was already expanded into those tests, while the parent workspace applies to +raw cases owned by the parent file. A "tests only" import mode may drop child +workspace context, but that must be opt-in. + +ADR 0006 defines the contract-layer model behind this rule: task data, task +prompt, task environment, and scoring come from the imported child suite; run +policy comes from the parent wrapper eval or CLI. `workspace` is task +environment, not prompt input, even though agents may inspect it through tools. + +If a future composition feature allows parent workspace override or remapping +for imported suites, it should be explicit and logged. The default should not +silently replace child workspace setup, because that setup is part of the +imported cases' validity. + +This decision creates follow-up behavior and docs beads: + +- `av-pkp` adds authoring diagnostics for misleading wrapper composition, + including parent workspace with only suite imports and ignored child + experiments. +- `av-ha5` guards incompatible imported-suite shared workspace compositions so + one wrapper run cannot silently use the wrong shared workspace. +- `av-82t` improves Dashboard/report display of the existing experiment + namespace and derived runtime source without adding new authored primitives. +- `av-58q` teaches the optional `evals/suites/` and `experiments/` + wrapper-eval folder convention without making the path schema-significant. + +## Consequences + +Positive: + +- AgentV avoids duplicating existing workspace and metadata concepts. +- The multi-repo workspace contract stays a product differentiator instead of + being collapsed into single-source benchmark vocabulary. +- SWE-bench, Harbor, Margin, promptfoo, Braintrust, LangSmith, OpenAI Evals, + Inspect, and Hugging Face mappings can remain adapters or docs recipes that + emit ordinary AgentV evals. +- Existing authoring concepts remain stable: workspace for setup, experiment + for runtime, tests for cases, metadata for source row details, run bundles for + audit. +- The `commit` field stays self-evident inside `workspace.repos[]`. +- Dashboard and reports can become clearer by explaining runtime source over + existing artifacts instead of adding configuration surface. + +Negative: + +- AgentV still needs strong docs examples so authors do not invent competing + provenance keys. +- Import/composition behavior needs a focused follow-up if parent evals include + child evals with conflicting workspaces. +- Some imported benchmark vocabulary such as SWE-bench `base_commit` must be + translated at the adapter boundary. +- Diagnostics are needed because the one-primitive model puts task suites and + wrapper experiments in the same file format. + +## Alternatives Considered + +- **Add top-level `source`.** Rejected. If it performs repo acquisition, it + conflicts with `workspace.repos[]`; if it is informational, it duplicates + metadata and sidecar manifests. +- **Use `source` for Harbor suite selection.** Rejected for core schema in this + decision. Harbor-backed execution should remain a runner/import boundary + until repeated usage proves a small AgentV selector is necessary. +- **Rename `commit` to `base_commit`.** Rejected. `base_commit` is useful + SWE-bench vocabulary, but `workspace.repos[].commit` is already scoped to a + checkout and works for branches, tags, SHAs, and non-SWE benchmarks. +- **Drop child workspaces when importing child evals.** Rejected as a default. + That turns valid imported cases into tests detached from their setup. +- **Copy benchmark-specific fields into AgentV.** Rejected. SWE-bench patches, + Harbor task TOML, Margin suite config, promptfoo provider matrices, and + Braintrust hosted experiment fields stay in adapters, fixtures, metadata, or + source-owned files. +- **Add authored `run_group`.** Rejected. The existing `experiment` namespace is + enough for artifact grouping. Runtime source should be derived for display, + not configured as another primitive. +- **Revive separate experiment artifacts.** Rejected. Wrapper experiments are + ordinary eval YAML files with inline `experiment:` blocks. +- **Make `experiments/` schema-significant.** Rejected. The folder may be a + user-owned repo layout convention, but AgentV should not infer semantics from + it. +- **Implicitly merge parent and child workspaces.** Rejected for now. Hook + order, repo path conflicts, isolation mode conflicts, and reset policies make + implicit merge too surprising. A future merge/override mode must be explicit + if real usage justifies it. + +## Non-Goals + +- Implementing schema changes in this ADR. +- Defining a benchmark catalog. +- Adding authored `run_group` or separate `experiment.yaml` primitives. +- Making `experiments/` schema-significant rather than a plain repo layout + convention for wrapper eval YAML. +- Implicitly merging parent and imported child workspaces. +- Rebuilding hosted experiment stores such as Braintrust or LangSmith. +- Making Harbor task packaging, verifier images, or Compose adapters + AgentV-native schema. +- Making Phoenix, OpenInference, or any trace backend the AgentV artifact owner. + +## References + +- Research artifact: [docs/plans/2026-06-27-001-docs-agentv-schema-benchmark-research-plan.md](../plans/2026-06-27-001-docs-agentv-schema-benchmark-research-plan.md) +- Strategy: [STRATEGY.md](../../STRATEGY.md) +- Roadmap: [ROADMAP.md](../../ROADMAP.md) +- Product boundary: [.agents/product-boundary.md](../../.agents/product-boundary.md) +- Technical conventions: [.agents/conventions.md](../../.agents/conventions.md) +- Harbor boundary: [docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md](0002-keep-harbor-benchmark-execution-behind-runner-boundary.md) +- Inline experiment decision: [docs/adr/0006-separate-experiments-from-eval-definitions.md](0006-separate-experiments-from-eval-definitions.md) diff --git a/docs/plans/2026-06-27-001-docs-agentv-schema-benchmark-research-plan.md b/docs/plans/2026-06-27-001-docs-agentv-schema-benchmark-research-plan.md new file mode 100644 index 000000000..4a2e89f1f --- /dev/null +++ b/docs/plans/2026-06-27-001-docs-agentv-schema-benchmark-research-plan.md @@ -0,0 +1,314 @@ +--- +title: "AgentV Schema Benchmark Research - Plan" +type: docs +date: 2026-06-27 +topic: agentv-schema-benchmark-research +artifact_contract: ce-unified-plan/v1 +artifact_readiness: requirements-only +product_contract_source: ce-brainstorm +execution: code +bead: av-2h9 +--- + +# AgentV Schema Benchmark Research - Plan + +## Goal Capsule + +- **Objective:** Frame benchmark-informed requirements for AgentV eval schema + authoring without implementing schema changes in this bead. +- **Product authority:** `STRATEGY.md`, `ROADMAP.md`, + `.agents/product-boundary.md`, `CONCEPTS.md`, + `docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md`, + `docs/adr/0006-separate-experiments-from-eval-definitions.md`, and + `docs/adr/0009-keep-benchmark-schema-on-existing-primitives.md`. +- **Decision summary:** Mature benchmark systems validate AgentV's existing + primitives more than they justify new schema. Keep using `workspace`, + `experiment`, target/workspace hooks, assertions, code graders, + `tests[].metadata`, and run bundles. Do not add a top-level `source` field + from this research. Do not rename `workspace.repos[].commit` to + `base_commit`. + +--- + +## Product Contract + +### Summary + +AgentV should make benchmark-shaped evals easier to author by documenting and +validating the existing lowest-common-denominator concepts: stable case ids, +multi-repo workspace setup, explicit runtime policy, expected/reference data, +hidden executable graders, repeat policy, result identity, and provenance in +metadata or sidecars. + +The research did not find a mature external framework concept that AgentV needs +to absorb into core schema now. The main correction is product framing: AgentV's +multi-repo `workspace` is stronger than benchmark `source` vocabulary. A generic +`source` field would either duplicate operational repo setup or become metadata +with a special name. + +### Key Decisions + +- **No new core source field.** Do not add top-level `source` for this bead's + findings. Use existing case metadata, source-owned sidecars, adapter + manifests, and generated run artifacts for provenance. +- **Keep `workspace.repos[]` operational.** Repository acquisition, checkout + refs, templates, hooks, and isolation stay under `workspace`, including + multi-repo cases. +- **Keep `commit` canonical.** `workspace.repos[].commit` remains the + self-evident checkout pin. `base_commit` is only an upstream SWE-bench import + term or compatibility alias if an adapter needs it. +- **Keep inline `experiment:` canonical.** Runtime binding, targets, repeat + policy, budgets, gates, and runner knobs stay in `experiment:`. +- **Keep external frameworks at boundaries.** Harbor, Margin, Braintrust, + LangSmith, promptfoo, OpenAI Evals, Inspect, Hugging Face Datasets, and + OpenInference inform adapters and docs, not AgentV-native object models. +- **Make composition explicit.** When a parent eval references child eval files + with `type: suite`, the current loader uses the parent `experiment:` and does + not fall back to the child `experiment:`. Child workspace remains task-owned: + imported suite tests keep their expanded child workspace, while parent + workspace applies to raw cases owned by the parent file. Any future parent + workspace override/remap should be explicit and logged. + +### Evidence Summary + +Mature systems share the same conceptual spine, even when their concrete +formats differ: + +| System | Useful observed concept | AgentV implication | +| --- | --- | --- | +| SWE-bench | Instances have `instance_id`, `repo`, `base_commit`, `problem_statement`, `patch`, `test_patch`, `FAIL_TO_PASS`, `PASS_TO_PASS`, and split-specific variants such as Verified difficulty. | Translate upstream ids and `base_commit` at import time; keep operational checkout refs in `workspace.repos[].commit` and hidden verifier data out of agent-visible input. | +| SWE-bench Verified | Human validation filters underspecified issues and unfair tests; annotations include difficulty and quality labels. | Preserve split, revision, and quality labels in metadata or manifests so results can be sliced without adding benchmark-specific fields. | +| Harbor | A task is instruction plus container environment plus test script; a dataset is a collection of tasks; a trial is an attempt; a job is a collection of trials. | Keep Harbor as a runner/import boundary and result source; do not copy Harbor task packaging into AgentV schema. | +| Harbor task format | `task.toml`, `instruction.md`, `environment/`, `solution/`, and `tests/` separate task metadata, environment, oracle solution, and verifier. | AgentV can model the same separation with workspaces, hooks, fixtures, expected output, and code graders. | +| Margin Lab evals | Suite directories use `suite.toml`, `case.toml`, `prompt.md`, `tests/test.sh`, optional `env/Dockerfile`, optional `oracle/solve.sh`, remote suite pins, resume semantics, and immutable run bundles. | Margin is the likely intended "Margin evals" reference; borrow hidden verifier/oracle separation and run-bundle discipline, not its config dialect. | +| Vercel `agent-eval` | Fixture directories combine `PROMPT.md`, hidden `EVAL.ts`, project files, experiment configs with `runs`, `earlyExit`, scripts, sandbox selection, and per-run transcript artifacts. | AgentV already has equivalent roles through workspace fixtures, code graders, `experiment.repeat`, gates, and run artifacts. | +| OpenAI Evals | Eval construction is dataset JSONL plus eval class/template registration; names encode eval, split, and version. | Importers should preserve dataset/split/version identity without adding a registry model. | +| Inspect | A `Task` combines dataset, solver, scorer, tools/agents, and optional sandboxing; Inspect Evals register entries include source repository URL and pinned commit metadata. | AgentV maps task/runtime/scorer concepts cleanly; source repo pins belong in `workspace.repos[]` only when AgentV materializes them. | +| Braintrust | Evaluations are data, task, and scores; experiments are immutable comparable records. | AgentV run bundles remain the immutable comparable record; no hosted backend required. | +| promptfoo | YAML combines prompts, providers, tests, assertions, imported test files, `defaultTest`, and matrix expansion. | Borrow import/default clarity, but keep repo-native target comparison in `experiment:` instead of prompt/provider matrices. | +| LangSmith | Offline evals use datasets/examples/reference outputs; experiments capture outputs, scores, and traces; online evals target runs/threads without references. | Keep expected/reference data distinct from run/trace evaluation and keep production monitoring out of core. | +| Hugging Face Datasets | Features, splits, dataset cards, typed columns, and dataset cards make corpus shape and provenance explicit. | Preserve corpus identity and columns in metadata/manifests when importing, without depending on Arrow or the Hub. | +| OpenInference | Span kinds and attributes standardize LLM, agent, tool, evaluator, token, and cost trace semantics. | Align trace metadata names where useful, but keep OpenInference as an observability/export boundary. | +| AgentV workspaces | AgentV can materialize multiple repositories into one eval workspace through `workspace.repos[]`. | Treat this as a differentiator to preserve; no surveyed benchmark framework provides a better core workspace model. | + +Research ambiguity: + +- "Harbor/Harbour" appears to refer to Harbor Framework; searches did not find + a separate primary "Harbour" eval framework using the British spelling. +- "Margin evals" most plausibly refers to Margin Lab's `Margin-Lab/evals` and + Marginlab public tracker work. No separate primary "Margin evals" standard + was identified. +- DeepWiki was used as secondary repo-orientation support for + `vercel-labs/agent-eval`, `openai/evals`, and + `UKGovernmentBEIS/inspect_evals`; primary claims were checked against official + docs, cloned repositories, or dataset cards. + +## Requirements + +### Existing Primitives + +- R1. AgentV should continue to represent operational repository setup through + `workspace.repos[]`, not through a generic source selector. +- R2. `workspace.repos[].commit` should remain canonical. `base_commit` should + be treated as upstream/import vocabulary, not a canonical AgentV rename. +- R3. Runtime policy should stay in inline `experiment:`: targets, workers, + budgets, repeat policy, gates, sandbox/runner knobs, and early-exit behavior. +- R4. Target and workspace hooks should remain the extension point for + harness-specific setup that external frameworks encode in their own runner + configs. +- R5. `expected_output`, assertions, and code graders should remain distinct: + passive reference data, executable scoring, and hidden verification should + not be collapsed. + +### Provenance + +- R6. Imported benchmark source identity should be represented with existing + `tests[].metadata`, source-owned sidecars, adapter manifests, and generated + run artifacts. +- R7. AgentV docs should recommend stable metadata keys for common imported + facts such as source benchmark id, split, revision, upstream row id, repo URL, + and curation labels, without making those keys new core schema. +- R8. Suite-level provenance should not require a new top-level field in this + research. If a future bead adds suite-level metadata, it should do so as a + general metadata capability, not as a benchmark-specific `source` block. +- R9. Hidden benchmark data such as SWE-bench `test_patch`, `FAIL_TO_PASS`, and + oracle files should stay in metadata, sidecars, fixtures, or code graders and + should not become agent-visible input by default. +- R10. Run artifacts should preserve enough compact provenance for audit, + comparison, filtering, and rerun without bloating every result row. + +### Composition and Imports + +- R11. Parent evals that reference child evals should own the runtime + `experiment:` for the parent run. +- R12. Child `experiment:` blocks should be ignored by parent `type: suite` + composition, even when the parent has no `experiment:`; there is currently no + child-experiment fallback. +- R13. Child `workspace` setup should remain task-owned. In the current loader, + imported suite tests keep their child workspace, and parent workspace applies + only to parent raw cases. +- R14. Parent workspace override/remap for imported suites should require an + explicit future syntax and should emit an info log explaining which workspace + is being used. +- R15. A tests-only import mode may drop child workspace context, but it must be + explicit because it changes case validity. +- R16. Workspace merge conflicts, path collisions, and incompatible isolation + settings should fail loudly rather than producing ambiguous setup. + +### Adapter Boundaries + +- R16. Harbor-backed execution should remain a runner/import boundary as + described in ADR 0002, with Harbor-owned task packaging and verifier details + outside AgentV core. +- R17. Margin, promptfoo, Braintrust, LangSmith, OpenAI Evals, Inspect, and + Hugging Face mappings should start as import/export adapters, examples, or + docs recipes. +- R18. Adapter output should prefer ordinary AgentV YAML plus sidecars over + pass-through maps, so humans and AI agents can inspect the generated evals. +- R19. Phoenix, OpenInference, Opik, Braintrust, and LangSmith links should stay + correlation/export metadata; AgentV run bundles remain the source of truth. + +## Recommended Schema Directions + +1. **Make no schema change from this research.** The benchmark comparison + supports AgentV's current primitives more than it supports new fields. +2. **Do not add top-level `source`.** It is redundant with either + `workspace.repos[]` or existing metadata/manifest patterns. +3. **Do not rename `commit`.** Keep `workspace.repos[].commit`; translate + SWE-bench `base_commit` at adapter boundaries when needed. +4. **Document composition semantics before implementing new imports.** Parent + evals own runtime `experiment:` without child fallback. Child workspaces are + preserved for `type: suite`; parent workspace applies to parent-owned raw + cases. Any future override/remap needs explicit syntax and an info log. +5. **Canonicalize docs toward `experiment:`.** Existing examples that still + teach `execution:` should be audited in a follow-up docs bead if that surface + is still transitional. +6. **Write benchmark recipes using current primitives.** SWE-style native cases, + Harbor-backed runs, Margin-style hidden verifiers, promptfoo-style test + imports, and Braintrust/LangSmith data rows can all be described without new + core schema. + +## Explicit Non-Goals + +- Do not implement schema changes in this bead. +- Do not add a top-level `source` field from this research. +- Do not rename `workspace.repos[].commit` to `base_commit`. +- Do not copy SWE-bench `patch`, `test_patch`, `FAIL_TO_PASS`, or + `PASS_TO_PASS` into AgentV top-level schema. +- Do not make Harbor `task.toml`, Docker network policy, verifier environment, + registry, or reward-file format AgentV-native schema. +- Do not make Margin Lab suite, agent-config, or eval-config files an AgentV + config dialect. +- Do not rebuild Braintrust, LangSmith, promptfoo, OpenAI Evals, Inspect, or + Hugging Face dataset registries inside AgentV. +- Do not make Phoenix, OpenInference, hosted traces, or hosted experiments the + AgentV artifact source of truth. + +## Compatibility and Migration Risks + +- `execution:` appears in existing examples while ADR 0006 says `experiment:` + is canonical. A follow-up docs/schema audit should decide whether this is + legacy compatibility or next-tag cleanup. +- `repeat` and `runs` both appear in some external or local vocabulary. AgentV + should keep `repeat` canonical unless a compatibility story requires aliases. +- Silently replacing child workspaces during eval composition can create false + failures or false passes. Composition needs explicit modes, info logs when an + override/remap is requested, and loud collision handling. +- Translating imported `base_commit` into `workspace.repos[].commit` may surprise + SWE-bench users unless docs show the mapping directly. +- Provenance in free-form metadata can drift across adapters. Docs should + recommend a small set of conventional keys even if core schema remains small. + +## Open Questions + +- OQ1. Which docs/examples should be hard-corrected from `execution:` to + `experiment:` before the next tag? +- OQ2. Should AgentV eventually support a formal suite-level `metadata` field, + and if so, should it be general-purpose rather than benchmark-specific? +- OQ3. Should AgentV add an info log for current `type: suite` imports when a + parent workspace exists, explaining that imported child tests keep child + workspace while parent workspace applies only to parent raw cases? +- OQ4. What exact composition syntax should distinguish full-suite include from + tests-only import and any future explicit workspace override/remap? +- OQ5. When multiple child evals provide `workspace.repos[]`, should path + collisions fail unconditionally, or should an explicit parent remap be + allowed? +- OQ6. Which public recipe should come first: native SWE-style task, + Harbor-backed standard suite, Margin-style hidden verifier, or promptfoo-style + test import? + +## Suggested Follow-Up Beads + +- `docs(schema): canonicalize eval runtime docs and examples` - Audit + `execution:` versus `experiment:`, `runs` versus `repeat`, and AI-facing + eval-builder references. +- `design(schema): eval composition semantics` - Define full-suite include, + tests-only import, current parent/child workspace ownership, optional info + logs, future workspace merge/remap, collision errors, and parent `experiment:` + override behavior. +- `docs(evals): benchmark authoring recipes` - Add human and AI docs for + SWE-bench-style, Harbor-backed, Margin-style, promptfoo-style, and + Braintrust/LangSmith-style mappings using existing AgentV primitives. +- `adapter(research): import provenance smoke fixtures` - Convert a tiny + SWE-bench Verified row and a promptfoo config into ordinary AgentV YAML to + validate docs before implementation. + +## Sources / Research + +- AgentV strategy and roadmap: `STRATEGY.md`, `ROADMAP.md`. +- AgentV product boundary and conventions: `.agents/product-boundary.md`, + `.agents/conventions.md`. +- AgentV vocabulary: `CONCEPTS.md`. +- Harbor runner boundary: + `docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md`. +- Inline experiment decision: + `docs/adr/0006-separate-experiments-from-eval-definitions.md`. +- Benchmark schema decision: + `docs/adr/0009-keep-benchmark-schema-on-existing-primitives.md`. +- Current AgentV eval schema: + `packages/core/src/evaluation/validation/eval-file.schema.ts`, + `packages/core/src/evaluation/experiment.ts`, + `packages/core/src/evaluation/result-row-schema.ts`, + `packages/core/src/evaluation/run-artifacts.ts`. +- AI research wiki synthesis: `concepts/benchmark-provenance-workspace-patterns.md`, + `concepts/minimal-eval-definition-schema.md`, `entities/swe-bench.md`, + `entities/margin-evals.md`, `entities/vercel-agent-eval.md` in + `tsoyang-org/ai-research-wiki`. +- SWE-bench dataset guide: + https://www.swebench.com/SWE-bench/guides/datasets/ +- SWE-bench Verified announcement: + https://openai.com/index/introducing-swe-bench-verified/ +- SWE-bench Verified dataset card: + https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified +- Harbor core concepts: https://www.harborframework.com/docs/core-concepts +- Harbor task structure: https://www.harborframework.com/docs/tasks +- Harbor adapters guide: + https://www.harborframework.com/docs/datasets/adapters +- Margin Lab evals repository: https://github.com/Margin-Lab/evals +- Vercel `agent-eval`: https://github.com/vercel-labs/agent-eval +- Vercel `agent-eval` source checks: + `packages/agent-eval/src/lib/types.ts` and + `packages/agent-eval/src/lib/results.ts` in + https://github.com/vercel-labs/agent-eval +- OpenAI Evals build guide: + https://github.com/openai/evals/blob/main/docs/build-eval.md +- Inspect AI docs: https://inspect.aisi.org.uk/ +- Inspect Evals registry: https://ukgovernmentbeis.github.io/inspect_evals/ +- Inspect Evals register schema example: + https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/register/example_eval.yaml +- Braintrust evaluation docs: https://www.braintrust.dev/docs/evaluate +- Braintrust evaluation quickstart: + https://www.braintrust.dev/docs/evaluation-quickstart +- promptfoo configuration docs: + https://www.promptfoo.dev/docs/configuration/guide/ +- promptfoo test case docs: + https://www.promptfoo.dev/docs/configuration/test-cases/ +- LangSmith evaluation docs: https://docs.langchain.com/langsmith/evaluation +- LangSmith evaluation concepts: + https://docs.langchain.com/langsmith/evaluation-concepts +- Hugging Face dataset features: + https://huggingface.co/docs/datasets/en/about_dataset_features +- Hugging Face dataset cards: https://huggingface.co/docs/hub/en/datasets-cards +- OpenInference specification: https://arize-ai.github.io/openinference/spec/ +- OpenInference semantic conventions: + https://arize-ai.github.io/openinference/spec/semantic_conventions.html