From 4ff3d18921e5c3a8954cd8c2dbfca472af8b4109 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 27 Jun 2026 05:32:27 +0200 Subject: [PATCH] fix(skills): default agentv-bench to CLI mode Entire-Checkpoint: da4bdc81f3ac --- skills-data/agentv-bench/SKILL.md | 20 +++++++++---------- .../references/subagent-pipeline.md | 7 ++++--- 2 files changed, 14 insertions(+), 13 deletions(-) diff --git a/skills-data/agentv-bench/SKILL.md b/skills-data/agentv-bench/SKILL.md index 44dff85e9..9d9ddbe66 100644 --- a/skills-data/agentv-bench/SKILL.md +++ b/skills-data/agentv-bench/SKILL.md @@ -129,24 +129,24 @@ Each run produces a new `.agentv/results/default//` directory automat **User instruction takes priority.** If the user says "run in subagent mode", "use subagent mode", or "use CLI mode", use that mode directly. -If the user has not specified a mode, default to `subagent`. +If the user has not specified a mode, default to AgentV CLI mode. -| `AGENT_EVAL_MODE` | Mode | How | -|----------------------|------|-----| -| `subagent` (default) | **Subagent mode** | Subagent-driven eval — parses eval.yaml, spawns executor + grader subagents. Zero CLI dependency. | -| `cli` | **AgentV CLI** | `agentv eval ` — end-to-end, multi-provider | +| Mode | When to use | How | +|------|-------------|-----| +| **AgentV CLI** (default) | Normal eval runs, multi-provider benchmarking, CI, artifact generation, and Dashboard-compatible results. | `agentv eval ` | +| **Subagent mode** | Explicit user request, no usable CLI/runtime, or rare provider/request-cost constraints. | Read `references/subagent-pipeline.md` first. | -Set `AGENT_EVAL_MODE` in `.env` at the project root as the default when no mode is specified. If absent, default to `subagent`. **User instruction always overrides this.** +Do not read `AGENT_EVAL_MODE` or another environment variable to decide the default mode. **User instruction always overrides the default.** -**`subagent`** — Parses eval.yaml directly, spawns executor subagents to run each test case in the current workspace, then spawns grader subagents to evaluate all assertion types natively. No CLI or external API calls required. Read `references/subagent-pipeline.md` for the detailed procedure. +**Subagent mode** — Parses eval.yaml directly, spawns executor subagents to run each test case in the current workspace, then spawns grader subagents to evaluate assertion types natively. This is an opt-in fallback. Read `references/subagent-pipeline.md` for the detailed procedure before using it. -**`cli`** — AgentV CLI handles execution, grading, and artifact generation end-to-end. Works with all providers. Use when you need multi-provider benchmarking or CLI-specific features. +**AgentV CLI mode** — AgentV CLI handles execution, grading, and artifact generation end-to-end. Works with all providers. Use this unless the user explicitly asks for subagent mode or the CLI path is unavailable. ### Running evaluations -**AgentV CLI mode** (end-to-end, EVAL.yaml): +**AgentV CLI mode** (default, end-to-end, EVAL.yaml): ```bash -agentv eval --output .agentv/artifacts/ +agentv eval ``` **Subagent mode** — read `references/subagent-pipeline.md` for the detailed procedure. In brief: use `pipeline input` to extract inputs, dispatch one `executor` subagent per test case (all in parallel), then proceed to grading below. diff --git a/skills-data/agentv-bench/references/subagent-pipeline.md b/skills-data/agentv-bench/references/subagent-pipeline.md index 6d1ba7510..81f09e476 100644 --- a/skills-data/agentv-bench/references/subagent-pipeline.md +++ b/skills-data/agentv-bench/references/subagent-pipeline.md @@ -1,8 +1,9 @@ # Subagent Pipeline — Running eval.yaml without CLI -This reference documents the detailed procedure for running evaluations in subagent mode -(`AGENT_EVAL_MODE=subagent`, the default). The orchestrating skill dispatches `executor` -subagents to perform test cases and `grader` subagents to evaluate outputs. +This reference documents the detailed procedure for running evaluations in subagent mode. +Subagent mode is an explicit opt-in fallback; the main `agentv-bench` flow defaults to +AgentV CLI mode. The orchestrating skill dispatches `executor` subagents to perform test +cases and `grader` subagents to evaluate outputs. Read this reference when executing Step 3 (Run and Grade) in subagent mode.