Test AI targets on real repo tasks and measure what actually works.
- Local-first — runs on your machine, no cloud accounts or API keys for eval infrastructure
- Repo-backed workspaces — reuse real repos, setup scripts, and existing harnesses instead of rebuilding synthetic tasks
- Portable artifacts — results, traces, and reports are saved in a durable format other tools can consume
- Version-controlled — evals, judges, and results all live in Git
- Hybrid graders — deterministic code checks + LLM-based subjective scoring
- CI/CD native — exit codes, JSONL output, threshold flags for pipeline gating
- Any target — run against agents, model providers, gateways, replay targets, CLI wrappers, transcript providers, and future app or service wrappers
- Eval suite / imports / tests are the task corpus: the prompts, cases, datasets, and imported benchmarks you want to evaluate.
- Category is derived from where the eval lives, such as folder path and file name. Use paths to organize the corpus instead of repeating category labels in every eval.
- Workspace / fixtures / graders are task-owned context: repos, setup scripts, files, fixtures, isolation, deterministic checks, and LLM grading prompts.
- Target is the system under test: an agent, provider, gateway, replay target, CLI wrapper, transcript provider, or future app/service wrapper. Use
modelwhen you need to override the target's default model for a run. - Experiment is the named condition being measured over that corpus, such as
backend-with-skillsorbackend-without-skills. - Policy controls how AgentV executes and gates the eval: runs, thresholds, timeouts, and budgets. It is not the experiment identity.
- Run is one concrete execution of an experiment against a target/model that writes portable artifacts for readers such as Dashboard, compare, and trend.
flowchart LR
corpus["Eval suite / imports / tests<br/>task corpus"]
category["Category<br/>path-derived grouping"]
context["Workspace / fixtures / graders<br/>task-owned context"]
experiment["Experiment<br/>named run condition"]
target["Target + model<br/>system under test"]
policy["Policy<br/>execution + gates"]
run["Run<br/>concrete execution"]
artifacts["Run artifacts<br/>summary.json + index.jsonl + sidecars"]
readers["Dashboard / compare / trend<br/>derived readers"]
corpus --> category
corpus --> run
context --> run
category --> run
experiment --> run
target --> run
policy --> run
run --> artifacts
artifacts --> readers
1. Install and initialize:
npm install -g agentv
agentv init2. Configure targets in .agentv/targets.yaml — point to the system under test, such as an agent, provider, gateway, replay source, or CLI wrapper.
3. Create an eval in evals/:
experiment: backend-with-skills
description: Code generation quality
target: copilot-sdk
model: claude-sonnet-4.6
workspace:
isolation: per_case
policy:
runs: 3
early_exit: false
timeout_seconds: 600
threshold: 0.8
budget_usd: 5
tests:
- id: fizzbuzz
input: Write FizzBuzz in Python
assertions:
- type: contains
value: "fizz"
- Implements correct FizzBuzz logic for multiples of 3, 5, and 15
- type: code-grader
command: ["python3", "./validators/check_syntax.py"]
- type: llm-grader
prompt: ./graders/correctness.md4. Run it:
agentv eval evals/my-eval.yaml5. Compare two runs (pass two index.jsonl manifests — e.g. before and after a change):
agentv compare .agentv/results/backend-without-skills/<timestamp>/copilot-sdk--claude-sonnet-4.6/index.jsonl .agentv/results/backend-with-skills/<timestamp>/copilot-sdk--claude-sonnet-4.6/index.jsonlEach run writes a timestamped invocation directory under .agentv/results/<experiment>/<timestamp>/. In this example, experiment: backend-with-skills names the condition being measured, target: copilot-sdk selects the system under test, and model: claude-sonnet-4.6 overrides that target's default model. The resolved target identity is still copilot-sdk--claude-sonnet-4.6 so CI baselines can distinguish model changes. The flat index.jsonl manifest is the portable surface used by scripts, CI, and agentv compare:
agentv eval evals/my-eval.yaml
cat .agentv/results/backend-with-skills/<timestamp>/copilot-sdk--claude-sonnet-4.6/index.jsonlRun bundle layout:
.agentv/results/
└── backend-with-skills/ # <experiment> — comparison/run grouping
└── 2026-06-30T08-30-00-000Z/ # <timestamp> — one run
└── copilot-sdk--claude-sonnet-4.6/ # <target> — resolved system under test
├── index.jsonl # flat per-test results (scripts/CI, `agentv compare`)
├── summary.json # run rollup: pass rate, counts, cost
└── fizzbuzz--a1b2c3d4/ # <result_dir> for one test case
├── summary.json # per-test rollup across runs
├── test/ # generated test bundle: frozen inputs for reproducibility
│ ├── EVAL.yaml # resolved eval spec
│ ├── targets.yaml # resolved target config
│ └── graders/ # grader files used
└── run-1/ # one attempt (run-N for repeats/trials)
├── result.json # compact attempt manifest
├── grading.json # per-assertion grading detail
├── metrics.json # tool calls, transcript stats, behavior metrics
├── timing.json # duration, token usage, cost
├── transcript.jsonl # parsed agent transcript
├── transcript-raw.jsonl # raw agent output (debugging)
└── outputs/ # captured stdout and grader outputs
Use evaluate() when your application owns the run:
import { evaluate } from '@agentv/sdk';
const { results, summary } = await evaluate({
experiment: 'backend-with-skills',
task: async (input) => runMyAppTarget(input),
threshold: 0.8,
tests: [
{
id: 'fizzbuzz',
input: 'Write FizzBuzz in Python',
assertions: [
{ type: 'contains', value: 'fizz' },
'Implements correct FizzBuzz logic for multiples of 3, 5, and 15',
{ type: 'code-grader', command: ['python3', './validators/check_syntax.py'] },
{ type: 'llm-grader', prompt: './graders/correctness.md' },
],
},
],
});
console.log(`${summary.passed}/${summary.total} passed`);Use defineEval() when you want AgentV to run the TypeScript eval file:
import { defineEval } from '@agentv/sdk';
export default defineEval({
experiment: 'backend-with-skills',
description: 'Code generation quality',
target: 'copilot-sdk',
model: 'claude-sonnet-4.6',
policy: {
runs: 3,
earlyExit: false,
timeoutSeconds: 600,
threshold: 0.8,
budgetUsd: 5,
},
workspace: {
isolation: 'per_case',
},
tests: [
{
id: 'fizzbuzz',
input: 'Write FizzBuzz in Python',
assertions: [
{ type: 'contains', value: 'fizz' },
'Implements correct FizzBuzz logic for multiples of 3, 5, and 15',
{ type: 'code-grader', command: ['python3', './validators/check_syntax.py'] },
{ type: 'llm-grader', prompt: './graders/correctness.md' },
],
},
],
});Full docs at agentv.dev/docs.
- Eval files — format and structure
- Custom graders — code graders in any language
- Rubrics — structured criteria scoring
- Targets — configure agents and providers
- Compare results — A/B testing and regression detection
- Ecosystem — how AgentV fits with Agent Control and Langfuse
git clone https://github.com/EntityProcess/agentv.git
cd agentv
bun install && bun run build
bun testSee AGENTS.md for development guidelines.
To simulate a one-command production deployment of AgentV Dashboard with the AgentV examples project and a remote results repository:
AGENTV_RESULTS_REPO=EntityProcess/agentv-evalresults \
scripts/setup-dashboard-deployment.shThe script clones AgentV examples into ~/agentv-dashboard, clones the results
repo, writes the Dashboard project registry under the $AGENTV_HOME config
pair, builds the Docker image, and starts Dashboard at http://localhost:3117.
MIT