AgentV

Test AI targets on real repo tasks and measure what actually works.

Why?

Local-first — runs on your machine, no cloud accounts or API keys for eval infrastructure
Repo-backed workspaces — reuse real repos, setup scripts, and existing harnesses instead of rebuilding synthetic tasks
Portable artifacts — results, traces, and reports are saved in a durable format other tools can consume
Version-controlled — evals, judges, and results all live in Git
Hybrid graders — deterministic code checks + LLM-based subjective scoring
CI/CD native — exit codes, JSONL output, threshold flags for pipeline gating
Any target — run against agents, model providers, gateways, replay targets, CLI wrappers, transcript providers, and future app or service wrappers

Core Concepts

Eval suite / imports / tests are the task corpus: the prompts, cases, datasets, and imported benchmarks you want to evaluate.
Category is derived from where the eval lives, such as folder path and file name. Use paths to organize the corpus instead of repeating category labels in every eval.
Workspace / fixtures / graders are task-owned context: repos, setup scripts, files, fixtures, isolation, deterministic checks, and LLM grading prompts.
Target is the system under test: an agent, provider, gateway, replay target, CLI wrapper, transcript provider, or future app/service wrapper. Use model when you need to override the target's default model for a run.
Experiment is the named condition being measured over that corpus, such as backend-with-skills or backend-without-skills.
Policy controls how AgentV executes and gates the eval: runs, thresholds, timeouts, and budgets. It is not the experiment identity.
Run is one concrete execution of an experiment against a target/model that writes portable artifacts for readers such as Dashboard, compare, and trend.

flowchart LR
  corpus["Eval suite / imports / tests<br/>task corpus"]
  category["Category<br/>path-derived grouping"]
  context["Workspace / fixtures / graders<br/>task-owned context"]
  experiment["Experiment<br/>named run condition"]
  target["Target + model<br/>system under test"]
  policy["Policy<br/>execution + gates"]
  run["Run<br/>concrete execution"]
  artifacts["Run artifacts<br/>summary.json + index.jsonl + sidecars"]
  readers["Dashboard / compare / trend<br/>derived readers"]

  corpus --> category
  corpus --> run
  context --> run
  category --> run
  experiment --> run
  target --> run
  policy --> run
  run --> artifacts
  artifacts --> readers

Quick start

1. Install and initialize:

npm install -g agentv
agentv init

2. Configure targets in .agentv/targets.yaml — point to the system under test, such as an agent, provider, gateway, replay source, or CLI wrapper.

3. Create an eval in evals/:

experiment: backend-with-skills
description: Code generation quality
target: copilot-sdk
model: claude-sonnet-4.6

workspace:
  isolation: per_case

policy:
  runs: 3
  early_exit: false
  timeout_seconds: 600
  threshold: 0.8
  budget_usd: 5

tests:
  - id: fizzbuzz
    input: Write FizzBuzz in Python
    assertions:
      - type: contains
        value: "fizz"
      - Implements correct FizzBuzz logic for multiples of 3, 5, and 15
      - type: code-grader
        command: ["python3", "./validators/check_syntax.py"]
      - type: llm-grader
        prompt: ./graders/correctness.md

4. Run it:

agentv eval evals/my-eval.yaml

5. Compare two runs (pass two index.jsonl manifests — e.g. before and after a change):

agentv compare .agentv/results/backend-without-skills/<timestamp>/copilot-sdk--claude-sonnet-4.6/index.jsonl .agentv/results/backend-with-skills/<timestamp>/copilot-sdk--claude-sonnet-4.6/index.jsonl

Results

Each run writes a timestamped invocation directory under .agentv/results/<experiment>/<timestamp>/. In this example, experiment: backend-with-skills names the condition being measured, target: copilot-sdk selects the system under test, and model: claude-sonnet-4.6 overrides that target's default model. The resolved target identity is still copilot-sdk--claude-sonnet-4.6 so CI baselines can distinguish model changes. The flat index.jsonl manifest is the portable surface used by scripts, CI, and agentv compare:

agentv eval evals/my-eval.yaml
cat .agentv/results/backend-with-skills/<timestamp>/copilot-sdk--claude-sonnet-4.6/index.jsonl

Run bundle layout:

.agentv/results/
└── backend-with-skills/              # <experiment> — comparison/run grouping
    └── 2026-06-30T08-30-00-000Z/     # <timestamp> — one run
        └── copilot-sdk--claude-sonnet-4.6/ # <target> — resolved system under test
            ├── index.jsonl           # flat per-test results (scripts/CI, `agentv compare`)
            ├── summary.json          # run rollup: pass rate, counts, cost
            └── fizzbuzz--a1b2c3d4/   # <result_dir> for one test case
                ├── summary.json      # per-test rollup across runs
                ├── test/             # generated test bundle: frozen inputs for reproducibility
                │   ├── EVAL.yaml     #   resolved eval spec
                │   ├── targets.yaml  #   resolved target config
                │   └── graders/      #   grader files used
                └── run-1/            # one attempt (run-N for repeats/trials)
                    ├── result.json   # compact attempt manifest
                    ├── grading.json  # per-assertion grading detail
                    ├── metrics.json  # tool calls, transcript stats, behavior metrics
                    ├── timing.json   # duration, token usage, cost
                    ├── transcript.jsonl       # parsed agent transcript
                    ├── transcript-raw.jsonl   # raw agent output (debugging)
                    └── outputs/      # captured stdout and grader outputs

TypeScript SDK

Use evaluate() when your application owns the run:

import { evaluate } from '@agentv/sdk';

const { results, summary } = await evaluate({
  experiment: 'backend-with-skills',
  task: async (input) => runMyAppTarget(input),
  threshold: 0.8,
  tests: [
    {
      id: 'fizzbuzz',
      input: 'Write FizzBuzz in Python',
      assertions: [
        { type: 'contains', value: 'fizz' },
        'Implements correct FizzBuzz logic for multiples of 3, 5, and 15',
        { type: 'code-grader', command: ['python3', './validators/check_syntax.py'] },
        { type: 'llm-grader', prompt: './graders/correctness.md' },
      ],
    },
  ],
});

console.log(`${summary.passed}/${summary.total} passed`);

Use defineEval() when you want AgentV to run the TypeScript eval file:

import { defineEval } from '@agentv/sdk';

export default defineEval({
  experiment: 'backend-with-skills',
  description: 'Code generation quality',
  target: 'copilot-sdk',
  model: 'claude-sonnet-4.6',
  policy: {
    runs: 3,
    earlyExit: false,
    timeoutSeconds: 600,
    threshold: 0.8,
    budgetUsd: 5,
  },
  workspace: {
    isolation: 'per_case',
  },
  tests: [
    {
      id: 'fizzbuzz',
      input: 'Write FizzBuzz in Python',
      assertions: [
        { type: 'contains', value: 'fizz' },
        'Implements correct FizzBuzz logic for multiples of 3, 5, and 15',
        { type: 'code-grader', command: ['python3', './validators/check_syntax.py'] },
        { type: 'llm-grader', prompt: './graders/correctness.md' },
      ],
    },
  ],
});

Documentation

Full docs at agentv.dev/docs.

Eval files — format and structure
Custom graders — code graders in any language
Rubrics — structured criteria scoring
Targets — configure agents and providers
Compare results — A/B testing and regression detection
Ecosystem — how AgentV fits with Agent Control and Langfuse

Development

git clone https://github.com/EntityProcess/agentv.git
cd agentv
bun install && bun run build
bun test

See AGENTS.md for development guidelines.

Docker Dashboard Deployment

To simulate a one-command production deployment of AgentV Dashboard with the AgentV examples project and a remote results repository:

AGENTV_RESULTS_REPO=EntityProcess/agentv-evalresults \
  scripts/setup-dashboard-deployment.sh

The script clones AgentV examples into ~/agentv-dashboard, clones the results repo, writes the Dashboard project registry under the $AGENTV_HOME config pair, builds the Docker image, and starts Dashboard at http://localhost:3117.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1,390 Commits
.agents		.agents
.agentv		.agentv
.beads		.beads
.claude-plugin		.claude-plugin
.entire		.entire
.github		.github
.pi/extensions/entire		.pi/extensions/entire
apps		apps
docs		docs
evals		evals
examples		examples
packages		packages
plugins		plugins
scripts		scripts
skills-data		skills-data
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONCEPTS.md		CONCEPTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
STRATEGY.md		STRATEGY.md
adrs.toml		adrs.toml
biome.json		biome.json
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
package.json		package.json
tsconfig.base.json		tsconfig.base.json
tsconfig.build.json		tsconfig.build.json
tsconfig.eslint.json		tsconfig.eslint.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AgentV

Why?

Core Concepts

Quick start

Results

TypeScript SDK

Documentation

Development

Docker Dashboard Deployment

License

About

Uh oh!

Releases 132

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AgentV

Why?

Core Concepts

Quick start

Results

TypeScript SDK

Documentation

Development

Docker Dashboard Deployment

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 132

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages