Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CONCEPTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ Shared domain vocabulary for this project — entities, named processes, and sta

**Workspace** — The task environment an eval prepares for the agent: repositories, templates, fixture files, and lifecycle hooks. It is not prompt input; use `input` for instructions and `workspace.repos[]` for multi-repo workspaces the agent can inspect or modify through tools.

**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`.
**Run manifest** — The root `run_manifest.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`.

**Result source identity** — The stable source identity for a result row: repo-relative `eval_path`, `test_id`, and `target`. `suite` and `name` are display metadata, not storage or routing identity.

**Result directory** — The `result_dir` field in an `index.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `index.jsonl` and must not infer it from suite names, display names, test IDs, or targets.
**Result directory** — The `result_dir` field in a `run_manifest.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `run_manifest.jsonl` and must not infer it from suite names, display names, test IDs, or targets.

**Artifact sidecar** — A file beside or below a result directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,14 +73,14 @@ agentv eval evals/my-eval.yaml

**5. Compare results across targets:**
```bash
agentv compare .agentv/results/default/<timestamp>/index.jsonl
agentv compare .agentv/results/default/<timestamp>/run_manifest.jsonl
```

## Output formats

```bash
agentv eval evals/my-eval.yaml --output ./run # writes ./run/index.jsonl
cat ./run/index.jsonl # JSONL results for scripts/CI
agentv eval evals/my-eval.yaml --output ./run # writes ./run/run_manifest.jsonl
cat ./run/run_manifest.jsonl # JSONL results for scripts/CI
```

## TypeScript SDK
Expand Down
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This roadmap translates [STRATEGY.md](STRATEGY.md) into the next few product pha

## Phase 1: Finish the artifact and local inspection foundation

- Keep the canonical handoff surface centered on completed run bundles, `index.jsonl`, grading/timing/metrics artifacts, normalized transcripts, and optional `external_trace` link metadata.
- Keep the canonical handoff surface centered on completed run bundles, `run_manifest.jsonl`, grading/timing/metrics artifacts, normalized transcripts, and optional `external_trace` link metadata.
- Finish the vendor-neutral local export seams that let completed runs be re-read, compared, exported, and attached to non-Phoenix adapters without vendor-specific logic in core.
- Keep OTLP/OpenInference mapping generic and reusable before building backend-specific upload or import paths.

Expand Down
2 changes: 1 addition & 1 deletion STRATEGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ AgentV stays repo-native and workspace-native: it runs or imports evaluations ar

- **Repo-native eval success** - Share of dogfood and example eval flows that run against real workspaces, hooks, repo materialization, or imported artifacts without extra infrastructure; measured by CI and manual UAT on canonical suites.
- **Time to inspect a run** - Time from completed `agentv eval` to usable local review, compare, or report output from the canonical run bundle; measured through CLI and Dashboard/report workflows.
- **Artifact portability coverage** - Share of integrations and follow-on workflows that consume `index.jsonl`, `summary.json`, trace sidecars, or imported run bundles instead of bespoke stores; measured by adapter smoke tests, docs, and example coverage.
- **Artifact portability coverage** - Share of integrations and follow-on workflows that consume `run_manifest.jsonl`, `summary.json`, trace sidecars, or imported run bundles instead of bespoke stores; measured by adapter smoke tests, docs, and example coverage.
- **Git-backed results reliability** - Success rate for publish, sync, resume, and WIP checkpoint flows across local branches and dedicated results repos; measured by integration tests and manual end-to-end verification.

## Tracks
Expand Down
2 changes: 1 addition & 1 deletion apps/cli/src/cli.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import { runCli } from './index.js';
// Forward SIGINT/SIGTERM to spawned provider subprocesses before exiting.
// Without this, Dashboard's `child.kill('SIGTERM')` against the CLI orphans
// any in-flight `claude`/`codex`/`pi`/`copilot` subprocess. The partial
// `index.jsonl` is already row-by-row durable, so finished tests survive.
// `run_manifest.jsonl` is already row-by-row durable, so finished tests survive.
//
// First signal: kill children, exit with the conventional 128+signal code.
// Second signal within the same process: hard-exit so a hung child cannot
Expand Down
6 changes: 3 additions & 3 deletions apps/cli/src/commands/compare/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -472,7 +472,7 @@ export const compareCommand = command({
type: string,
displayName: 'results',
description:
'Run workspace or index.jsonl manifest path(s). One source: single-run mode. Two sources: pairwise mode.',
'Run workspace or run manifest path(s). One source: single-run mode. Two sources: pairwise mode.',
}),
threshold: option({
type: optional(number),
Expand Down Expand Up @@ -514,7 +514,7 @@ export const compareCommand = command({

try {
if (results.length === 0) {
throw new Error('At least one run workspace or index.jsonl manifest is required');
throw new Error('At least one run workspace or run manifest is required');
}

if (results.length === 2) {
Expand Down Expand Up @@ -602,7 +602,7 @@ export const compareCommand = command({
process.exit(exitCode);
}
} else {
throw new Error('Expected 1 or 2 run workspaces or index.jsonl manifests');
throw new Error('Expected 1 or 2 run workspaces or run manifests');
}
} catch (error) {
console.error(`Error: ${(error as Error).message}`);
Expand Down
2 changes: 1 addition & 1 deletion apps/cli/src/commands/eval/commands/aggregate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ export const evalAggregateCommand = command({
runDir: positional({
type: string,
displayName: 'run-dir',
description: 'Path to a run directory containing index.jsonl',
description: 'Path to a run directory containing a run manifest',
}),
},
handler: async (args) => {
Expand Down
6 changes: 3 additions & 3 deletions apps/cli/src/commands/eval/commands/run.ts
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,12 @@ export const evalRunCommand = command({
long: 'output',
short: 'o',
description:
'Run artifact directory (writes index.jsonl, summary.json, and per-case artifacts)',
'Run artifact directory (writes run_manifest.jsonl, summary.json, and per-case artifacts)',
}),
outputFormat: option({
type: optional(string),
long: 'output-format',
description: '[Removed] Run directories always write index.jsonl',
description: '[Removed] Run directories always write run_manifest.jsonl',
}),
experiment: option({
type: optional(string),
Expand Down Expand Up @@ -161,7 +161,7 @@ export const evalRunCommand = command({
type: optional(string),
long: 'retry-errors',
description:
'Path to a previous run workspace or index.jsonl manifest — re-run only execution_error test cases',
'Path to a previous run workspace or run manifest — re-run only execution_error test cases',
}),
resume: flag({
long: 'resume',
Expand Down
9 changes: 5 additions & 4 deletions apps/cli/src/commands/eval/interactive.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import {
getCategories,
} from './discover.js';
import { type LastConfig, loadLastConfig, saveLastConfig } from './last-config.js';
import { resolveExistingRunPrimaryPath } from './result-layout.js';
import { runEvalCommand } from './run-eval.js';
import { findRepoRoot } from './shared.js';

Expand Down Expand Up @@ -89,10 +90,10 @@ async function promptMainMenu(
type MenuChoice = 'new' | 'rerun' | 'resume' | 'exit';
const choices: Array<{ name: string; value: MenuChoice; description?: string }> = [];

// Resume entry: only when the prior run has a known artifact dir with an index.jsonl
// Resume entry: only when the prior run has a known artifact dir with a manifest.
if (lastConfig?.outputDir) {
const indexPath = path.join(lastConfig.outputDir, 'index.jsonl');
if (existsSync(indexPath)) {
const indexPath = resolveExistingRunPrimaryPath(lastConfig.outputDir);
if (indexPath && existsSync(indexPath)) {
const dirLabel = path.basename(lastConfig.outputDir);
choices.push({
name: '⏯ Resume last run',
Expand Down Expand Up @@ -349,7 +350,7 @@ async function executeConfig(

// Persist config with the resolved artifact dir so the wizard can offer
// "Resume last run" on the next invocation. Done after a successful run so
// the saved outputDir always points at a real index.jsonl.
// the saved outputDir always points at a real run manifest.
if (result) {
await saveLastConfig({
timestamp: new Date().toISOString(),
Expand Down
64 changes: 55 additions & 9 deletions apps/cli/src/commands/eval/result-layout.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
import { type Dirent, existsSync, readdirSync, statSync } from 'node:fs';
import { type Dirent, existsSync, readFileSync, readdirSync, statSync } from 'node:fs';
import path from 'node:path';

export const RESULT_INDEX_FILENAME = 'index.jsonl';
export const RESULT_MANIFEST_FILENAME = 'run_manifest.jsonl';
export const LEGACY_RESULT_INDEX_FILENAME = 'index.jsonl';
// Backward-compatible export name retained for existing callers. New writes use
// the row-level run manifest filename.
export const RESULT_INDEX_FILENAME = RESULT_MANIFEST_FILENAME;
export const RESULT_MANIFEST_FILENAMES = [
RESULT_MANIFEST_FILENAME,
LEGACY_RESULT_INDEX_FILENAME,
] as const;
export const RUN_SUMMARY_FILENAME = 'summary.json';
export const RESULTS_DIRNAME = 'results';
export const DEFAULT_EXPERIMENT_NAME = 'default';
export const RESERVED_RESULTS_NAMESPACES = new Set(['export', 'metadata', 'runs']);
Expand Down Expand Up @@ -64,13 +73,48 @@ export function resolveRunIndexPath(runDir: string): string {
}

export function isRunManifestPath(filePath: string): boolean {
return path.basename(filePath) === RESULT_INDEX_FILENAME;
return RESULT_MANIFEST_FILENAMES.includes(
path.basename(filePath) as (typeof RESULT_MANIFEST_FILENAMES)[number],
);
}

function safeSummaryManifestPath(runDir: string, manifestPath: unknown): string | undefined {
if (typeof manifestPath !== 'string' || manifestPath.trim().length === 0) {
return undefined;
}
if (path.isAbsolute(manifestPath)) {
return undefined;
}
const normalized = path.normalize(manifestPath);
if (normalized.startsWith('..') || path.isAbsolute(normalized)) {
return undefined;
}
return path.join(runDir, normalized);
}

function resolveSummaryManifestPath(runDir: string): string | undefined {
try {
const summary = JSON.parse(readFileSync(path.join(runDir, RUN_SUMMARY_FILENAME), 'utf8')) as {
manifest_path?: unknown;
};
const manifestPath = safeSummaryManifestPath(runDir, summary.manifest_path);
return manifestPath && existsSync(manifestPath) ? manifestPath : undefined;
} catch {
return undefined;
}
}

export function resolveExistingRunPrimaryPath(runDir: string): string | undefined {
const indexPath = resolveRunIndexPath(runDir);
if (existsSync(indexPath)) {
return indexPath;
const summaryManifestPath = resolveSummaryManifestPath(runDir);
if (summaryManifestPath) {
return summaryManifestPath;
}

for (const filename of RESULT_MANIFEST_FILENAMES) {
const manifestPath = path.join(runDir, filename);
if (existsSync(manifestPath)) {
return manifestPath;
}
}

return undefined;
Expand Down Expand Up @@ -131,10 +175,12 @@ export function resolveWorkspaceOrFilePath(filePath: string): string {
}
if (nested.length > 1) {
throw new Error(
`Result workspace contains multiple ${RESULT_INDEX_FILENAME} manifests; pass one bundle directory or manifest: ${filePath}`,
`Result workspace contains multiple run manifests; pass one bundle directory or manifest: ${filePath}`,
);
}
throw new Error(`Result workspace is missing ${RESULT_INDEX_FILENAME}: ${filePath}`);
throw new Error(
`Result workspace is missing ${RESULT_MANIFEST_FILENAME} or legacy ${LEGACY_RESULT_INDEX_FILENAME}: ${filePath}`,
);
}

export function resolveRunManifestPath(filePath: string): string {
Expand All @@ -144,7 +190,7 @@ export function resolveRunManifestPath(filePath: string): string {

if (!isRunManifestPath(filePath)) {
throw new Error(
`Expected a run workspace directory or ${RESULT_INDEX_FILENAME} manifest: ${filePath}`,
`Expected a run workspace directory or ${RESULT_MANIFEST_FILENAME} manifest (legacy ${LEGACY_RESULT_INDEX_FILENAME} is also readable): ${filePath}`,
);
}

Expand Down
4 changes: 2 additions & 2 deletions apps/cli/src/commands/eval/run-cache.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import path from 'node:path';
import {
RESULT_INDEX_FILENAME,
discoverRunManifestPaths,
isRunManifestPath,
resolveExistingRunPrimaryPath,
resolveRunIndexPath,
} from './result-layout.js';
Expand Down Expand Up @@ -67,8 +68,7 @@ export async function resolveCachedRunDir(cwd: string): Promise<string | undefin

export async function saveRunCache(cwd: string, resultPath: string): Promise<void> {
const dir = path.join(cwd, '.agentv');
const lastRunDir =
path.basename(resultPath) === RESULT_INDEX_FILENAME ? path.dirname(resultPath) : resultPath;
const lastRunDir = isRunManifestPath(resultPath) ? path.dirname(resultPath) : resultPath;
await mkdir(dir, { recursive: true });
const cache: RunCache = {
lastRunDir,
Expand Down
21 changes: 11 additions & 10 deletions apps/cli/src/commands/eval/run-eval.ts
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ import { resolveOtelBackend } from './otel-backends.js';
import { type OutputWriter, createOutputWriter } from './output-writer.js';
import { ProgressDisplay, type Verdict, type WorkerProgress } from './progress-display.js';
import {
RESULT_INDEX_FILENAME,
buildDefaultRunDirFromName,
createRunDirName,
discoverRunManifestPaths,
Expand Down Expand Up @@ -139,7 +140,7 @@ interface NormalizedOptions {
readonly keepWorkspaces: boolean;
/** Removed: use --output instead */
readonly artifacts?: string;
/** Removed: the run directory always uses index.jsonl */
/** Removed: the run directory always uses run_manifest.jsonl */
readonly outputFormat?: string;
readonly graderTarget?: string;
readonly model?: string;
Expand Down Expand Up @@ -288,7 +289,7 @@ function outputFileMigrationMessage(value: string): string {
ext === '.xml'
? 'JUnit XML export from agentv eval has been removed.'
: 'Flat result file export from agentv eval has been removed.';
return `--output expects a run directory, not a file path: ${value}\n${removalHint} Set --output <dir> for the canonical run artifacts; AgentV always writes <dir>/index.jsonl.`;
return `--output expects a run directory, not a file path: ${value}\n${removalHint} Set --output <dir> for the canonical run artifacts; AgentV always writes <dir>/${RESULT_INDEX_FILENAME}.`;
}

function artifactsMigrationMessage(artifactsDir: string, outputDir?: string): string {
Expand Down Expand Up @@ -1076,7 +1077,7 @@ class BundleOutputWriter implements OutputWriter {
}
const dir = resultBundleDir(this.invocationDir, result);
mkdirSync(dir, { recursive: true });
const indexPath = path.join(dir, 'index.jsonl');
const indexPath = path.join(dir, RESULT_INDEX_FILENAME);
const writer = await createOutputWriter(indexPath, { append: this.appendMode });
this.writers.set(key, { dir, indexPath, writer });
return writer;
Expand Down Expand Up @@ -1682,7 +1683,7 @@ export async function runEvalCommand(
}
if (options.outputFormat) {
throw new Error(
'--output-format was removed from agentv eval. The run directory always writes index.jsonl.',
`--output-format was removed from agentv eval. The run directory always writes ${RESULT_INDEX_FILENAME}.`,
);
}
if (options.artifacts) {
Expand Down Expand Up @@ -1754,8 +1755,8 @@ export async function runEvalCommand(
`${modeLabel}: found ${existingResults.length} existing result(s), skipping ${resumeSkipKeys.size} completed.`,
);
} else {
// No existing bundle index.jsonl — behave like a normal run
console.log('Resume: no existing bundle index.jsonl found, starting fresh run.');
// No existing bundle manifest — behave like a normal run.
console.log('Resume: no existing bundle run manifest found, starting fresh run.');
}
} else {
console.warn(
Expand Down Expand Up @@ -2430,7 +2431,7 @@ export async function runEvalCommand(
}
if (isResumeAppend) {
// Resume mode: write per-test artifacts for newly-run tests, then
// aggregate each bundle from its full index.jsonl (old + new results
// aggregate each bundle from its full row manifest (old + new results
// with deduplication).
const { writePerTestArtifacts } = await import('./artifact-writer.js');
for (const bundleResults of resultsByBundle.values()) {
Expand All @@ -2450,9 +2451,9 @@ export async function runEvalCommand(
experimentMetadata: runExperimentMetadata,
runtimeSource: runtimeSourceMetadata,
});
const indexPath = path.join(bundleDir, 'index.jsonl');
const indexPath = path.join(bundleDir, RESULT_INDEX_FILENAME);
console.log(`Artifact bundle updated: ${bundleDir}`);
console.log(` Index: ${indexPath}`);
console.log(` Run manifest: ${indexPath}`);
console.log(
` Per-test artifacts: ${bundleDir} (${bundleResults.length} new test directories)`,
);
Expand All @@ -2477,7 +2478,7 @@ export async function runEvalCommand(
},
);
console.log(`Artifact bundle written to: ${bundleDir}`);
console.log(` Index: ${indexPath}`);
console.log(` Run manifest: ${indexPath}`);
console.log(
` Per-test artifacts: ${testArtifactDir} (${bundleResults.length} test directories)`,
);
Expand Down
4 changes: 2 additions & 2 deletions apps/cli/src/commands/grade/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ function printHumanOutput(result: GradePreparedResult): void {
console.log(`Trace: ${result.tracePath}`);
}
console.log(`Artifact workspace: ${result.outputDir}`);
console.log(`Index: ${result.indexPath}`);
console.log(`Run manifest: ${result.indexPath}`);
}

function isTraceEnvelopeDocument(value: unknown): boolean {
Expand Down Expand Up @@ -625,7 +625,7 @@ export const gradeCommand = command({
type: optional(string),
long: 'output',
short: 'o',
description: 'Run artifact directory (writes index.jsonl and per-test artifacts)',
description: 'Run artifact directory (writes run_manifest.jsonl and per-test artifacts)',
}),
response: option({
type: optional(string),
Expand Down
Loading
Loading