diff --git a/CONCEPTS.md b/CONCEPTS.md index f578b1569..5443552da 100644 --- a/CONCEPTS.md +++ b/CONCEPTS.md @@ -24,11 +24,11 @@ Shared domain vocabulary for this project — entities, named processes, and sta **Workspace** — The task environment an eval prepares for the agent: repositories, templates, fixture files, and lifecycle hooks. It is not prompt input; use `input` for instructions and `workspace.repos[]` for multi-repo workspaces the agent can inspect or modify through tools. -**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`. +**Run manifest** — The root `run_manifest.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`. **Result source identity** — The stable source identity for a result row: repo-relative `eval_path`, `test_id`, and `target`. `suite` and `name` are display metadata, not storage or routing identity. -**Result directory** — The `result_dir` field in an `index.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `index.jsonl` and must not infer it from suite names, display names, test IDs, or targets. +**Result directory** — The `result_dir` field in a `run_manifest.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `run_manifest.jsonl` and must not infer it from suite names, display names, test IDs, or targets. **Artifact sidecar** — A file beside or below a result directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run. diff --git a/README.md b/README.md index 0d8688b7e..50da679c3 100644 --- a/README.md +++ b/README.md @@ -73,14 +73,14 @@ agentv eval evals/my-eval.yaml **5. Compare results across targets:** ```bash -agentv compare .agentv/results/default//index.jsonl +agentv compare .agentv/results/default//run_manifest.jsonl ``` ## Output formats ```bash -agentv eval evals/my-eval.yaml --output ./run # writes ./run/index.jsonl -cat ./run/index.jsonl # JSONL results for scripts/CI +agentv eval evals/my-eval.yaml --output ./run # writes ./run/run_manifest.jsonl +cat ./run/run_manifest.jsonl # JSONL results for scripts/CI ``` ## TypeScript SDK diff --git a/ROADMAP.md b/ROADMAP.md index e8d9f1b49..7096fd810 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -15,7 +15,7 @@ This roadmap translates [STRATEGY.md](STRATEGY.md) into the next few product pha ## Phase 1: Finish the artifact and local inspection foundation -- Keep the canonical handoff surface centered on completed run bundles, `index.jsonl`, grading/timing/metrics artifacts, normalized transcripts, and optional `external_trace` link metadata. +- Keep the canonical handoff surface centered on completed run bundles, `run_manifest.jsonl`, grading/timing/metrics artifacts, normalized transcripts, and optional `external_trace` link metadata. - Finish the vendor-neutral local export seams that let completed runs be re-read, compared, exported, and attached to non-Phoenix adapters without vendor-specific logic in core. - Keep OTLP/OpenInference mapping generic and reusable before building backend-specific upload or import paths. diff --git a/STRATEGY.md b/STRATEGY.md index 5f52388d9..41819a985 100644 --- a/STRATEGY.md +++ b/STRATEGY.md @@ -21,7 +21,7 @@ AgentV stays repo-native and workspace-native: it runs or imports evaluations ar - **Repo-native eval success** - Share of dogfood and example eval flows that run against real workspaces, hooks, repo materialization, or imported artifacts without extra infrastructure; measured by CI and manual UAT on canonical suites. - **Time to inspect a run** - Time from completed `agentv eval` to usable local review, compare, or report output from the canonical run bundle; measured through CLI and Dashboard/report workflows. -- **Artifact portability coverage** - Share of integrations and follow-on workflows that consume `index.jsonl`, `summary.json`, trace sidecars, or imported run bundles instead of bespoke stores; measured by adapter smoke tests, docs, and example coverage. +- **Artifact portability coverage** - Share of integrations and follow-on workflows that consume `run_manifest.jsonl`, `summary.json`, trace sidecars, or imported run bundles instead of bespoke stores; measured by adapter smoke tests, docs, and example coverage. - **Git-backed results reliability** - Success rate for publish, sync, resume, and WIP checkpoint flows across local branches and dedicated results repos; measured by integration tests and manual end-to-end verification. ## Tracks diff --git a/apps/cli/src/cli.ts b/apps/cli/src/cli.ts index 9c7946ed1..c2b50fa0e 100644 --- a/apps/cli/src/cli.ts +++ b/apps/cli/src/cli.ts @@ -6,7 +6,7 @@ import { runCli } from './index.js'; // Forward SIGINT/SIGTERM to spawned provider subprocesses before exiting. // Without this, Dashboard's `child.kill('SIGTERM')` against the CLI orphans // any in-flight `claude`/`codex`/`pi`/`copilot` subprocess. The partial -// `index.jsonl` is already row-by-row durable, so finished tests survive. +// `run_manifest.jsonl` is already row-by-row durable, so finished tests survive. // // First signal: kill children, exit with the conventional 128+signal code. // Second signal within the same process: hard-exit so a hung child cannot diff --git a/apps/cli/src/commands/compare/index.ts b/apps/cli/src/commands/compare/index.ts index bf4852606..20f1eba31 100644 --- a/apps/cli/src/commands/compare/index.ts +++ b/apps/cli/src/commands/compare/index.ts @@ -472,7 +472,7 @@ export const compareCommand = command({ type: string, displayName: 'results', description: - 'Run workspace or index.jsonl manifest path(s). One source: single-run mode. Two sources: pairwise mode.', + 'Run workspace or run manifest path(s). One source: single-run mode. Two sources: pairwise mode.', }), threshold: option({ type: optional(number), @@ -514,7 +514,7 @@ export const compareCommand = command({ try { if (results.length === 0) { - throw new Error('At least one run workspace or index.jsonl manifest is required'); + throw new Error('At least one run workspace or run manifest is required'); } if (results.length === 2) { @@ -602,7 +602,7 @@ export const compareCommand = command({ process.exit(exitCode); } } else { - throw new Error('Expected 1 or 2 run workspaces or index.jsonl manifests'); + throw new Error('Expected 1 or 2 run workspaces or run manifests'); } } catch (error) { console.error(`Error: ${(error as Error).message}`); diff --git a/apps/cli/src/commands/eval/commands/aggregate.ts b/apps/cli/src/commands/eval/commands/aggregate.ts index 275e792ab..67df96b72 100644 --- a/apps/cli/src/commands/eval/commands/aggregate.ts +++ b/apps/cli/src/commands/eval/commands/aggregate.ts @@ -11,7 +11,7 @@ export const evalAggregateCommand = command({ runDir: positional({ type: string, displayName: 'run-dir', - description: 'Path to a run directory containing index.jsonl', + description: 'Path to a run directory containing a run manifest', }), }, handler: async (args) => { diff --git a/apps/cli/src/commands/eval/commands/run.ts b/apps/cli/src/commands/eval/commands/run.ts index b577f6e59..97458dbf3 100644 --- a/apps/cli/src/commands/eval/commands/run.ts +++ b/apps/cli/src/commands/eval/commands/run.ts @@ -52,12 +52,12 @@ export const evalRunCommand = command({ long: 'output', short: 'o', description: - 'Run artifact directory (writes index.jsonl, summary.json, and per-case artifacts)', + 'Run artifact directory (writes run_manifest.jsonl, summary.json, and per-case artifacts)', }), outputFormat: option({ type: optional(string), long: 'output-format', - description: '[Removed] Run directories always write index.jsonl', + description: '[Removed] Run directories always write run_manifest.jsonl', }), experiment: option({ type: optional(string), @@ -161,7 +161,7 @@ export const evalRunCommand = command({ type: optional(string), long: 'retry-errors', description: - 'Path to a previous run workspace or index.jsonl manifest — re-run only execution_error test cases', + 'Path to a previous run workspace or run manifest — re-run only execution_error test cases', }), resume: flag({ long: 'resume', diff --git a/apps/cli/src/commands/eval/interactive.ts b/apps/cli/src/commands/eval/interactive.ts index 268fb231b..601630114 100644 --- a/apps/cli/src/commands/eval/interactive.ts +++ b/apps/cli/src/commands/eval/interactive.ts @@ -11,6 +11,7 @@ import { getCategories, } from './discover.js'; import { type LastConfig, loadLastConfig, saveLastConfig } from './last-config.js'; +import { resolveExistingRunPrimaryPath } from './result-layout.js'; import { runEvalCommand } from './run-eval.js'; import { findRepoRoot } from './shared.js'; @@ -89,10 +90,10 @@ async function promptMainMenu( type MenuChoice = 'new' | 'rerun' | 'resume' | 'exit'; const choices: Array<{ name: string; value: MenuChoice; description?: string }> = []; - // Resume entry: only when the prior run has a known artifact dir with an index.jsonl + // Resume entry: only when the prior run has a known artifact dir with a manifest. if (lastConfig?.outputDir) { - const indexPath = path.join(lastConfig.outputDir, 'index.jsonl'); - if (existsSync(indexPath)) { + const indexPath = resolveExistingRunPrimaryPath(lastConfig.outputDir); + if (indexPath && existsSync(indexPath)) { const dirLabel = path.basename(lastConfig.outputDir); choices.push({ name: '⏯ Resume last run', @@ -349,7 +350,7 @@ async function executeConfig( // Persist config with the resolved artifact dir so the wizard can offer // "Resume last run" on the next invocation. Done after a successful run so - // the saved outputDir always points at a real index.jsonl. + // the saved outputDir always points at a real run manifest. if (result) { await saveLastConfig({ timestamp: new Date().toISOString(), diff --git a/apps/cli/src/commands/eval/result-layout.ts b/apps/cli/src/commands/eval/result-layout.ts index 47e8d1fb6..ce53cde6c 100644 --- a/apps/cli/src/commands/eval/result-layout.ts +++ b/apps/cli/src/commands/eval/result-layout.ts @@ -1,7 +1,16 @@ -import { type Dirent, existsSync, readdirSync, statSync } from 'node:fs'; +import { type Dirent, existsSync, readFileSync, readdirSync, statSync } from 'node:fs'; import path from 'node:path'; -export const RESULT_INDEX_FILENAME = 'index.jsonl'; +export const RESULT_MANIFEST_FILENAME = 'run_manifest.jsonl'; +export const LEGACY_RESULT_INDEX_FILENAME = 'index.jsonl'; +// Backward-compatible export name retained for existing callers. New writes use +// the row-level run manifest filename. +export const RESULT_INDEX_FILENAME = RESULT_MANIFEST_FILENAME; +export const RESULT_MANIFEST_FILENAMES = [ + RESULT_MANIFEST_FILENAME, + LEGACY_RESULT_INDEX_FILENAME, +] as const; +export const RUN_SUMMARY_FILENAME = 'summary.json'; export const RESULTS_DIRNAME = 'results'; export const DEFAULT_EXPERIMENT_NAME = 'default'; export const RESERVED_RESULTS_NAMESPACES = new Set(['export', 'metadata', 'runs']); @@ -64,13 +73,48 @@ export function resolveRunIndexPath(runDir: string): string { } export function isRunManifestPath(filePath: string): boolean { - return path.basename(filePath) === RESULT_INDEX_FILENAME; + return RESULT_MANIFEST_FILENAMES.includes( + path.basename(filePath) as (typeof RESULT_MANIFEST_FILENAMES)[number], + ); +} + +function safeSummaryManifestPath(runDir: string, manifestPath: unknown): string | undefined { + if (typeof manifestPath !== 'string' || manifestPath.trim().length === 0) { + return undefined; + } + if (path.isAbsolute(manifestPath)) { + return undefined; + } + const normalized = path.normalize(manifestPath); + if (normalized.startsWith('..') || path.isAbsolute(normalized)) { + return undefined; + } + return path.join(runDir, normalized); +} + +function resolveSummaryManifestPath(runDir: string): string | undefined { + try { + const summary = JSON.parse(readFileSync(path.join(runDir, RUN_SUMMARY_FILENAME), 'utf8')) as { + manifest_path?: unknown; + }; + const manifestPath = safeSummaryManifestPath(runDir, summary.manifest_path); + return manifestPath && existsSync(manifestPath) ? manifestPath : undefined; + } catch { + return undefined; + } } export function resolveExistingRunPrimaryPath(runDir: string): string | undefined { - const indexPath = resolveRunIndexPath(runDir); - if (existsSync(indexPath)) { - return indexPath; + const summaryManifestPath = resolveSummaryManifestPath(runDir); + if (summaryManifestPath) { + return summaryManifestPath; + } + + for (const filename of RESULT_MANIFEST_FILENAMES) { + const manifestPath = path.join(runDir, filename); + if (existsSync(manifestPath)) { + return manifestPath; + } } return undefined; @@ -131,10 +175,12 @@ export function resolveWorkspaceOrFilePath(filePath: string): string { } if (nested.length > 1) { throw new Error( - `Result workspace contains multiple ${RESULT_INDEX_FILENAME} manifests; pass one bundle directory or manifest: ${filePath}`, + `Result workspace contains multiple run manifests; pass one bundle directory or manifest: ${filePath}`, ); } - throw new Error(`Result workspace is missing ${RESULT_INDEX_FILENAME}: ${filePath}`); + throw new Error( + `Result workspace is missing ${RESULT_MANIFEST_FILENAME} or legacy ${LEGACY_RESULT_INDEX_FILENAME}: ${filePath}`, + ); } export function resolveRunManifestPath(filePath: string): string { @@ -144,7 +190,7 @@ export function resolveRunManifestPath(filePath: string): string { if (!isRunManifestPath(filePath)) { throw new Error( - `Expected a run workspace directory or ${RESULT_INDEX_FILENAME} manifest: ${filePath}`, + `Expected a run workspace directory or ${RESULT_MANIFEST_FILENAME} manifest (legacy ${LEGACY_RESULT_INDEX_FILENAME} is also readable): ${filePath}`, ); } diff --git a/apps/cli/src/commands/eval/run-cache.ts b/apps/cli/src/commands/eval/run-cache.ts index 14969e6be..95f8b3b46 100644 --- a/apps/cli/src/commands/eval/run-cache.ts +++ b/apps/cli/src/commands/eval/run-cache.ts @@ -5,6 +5,7 @@ import path from 'node:path'; import { RESULT_INDEX_FILENAME, discoverRunManifestPaths, + isRunManifestPath, resolveExistingRunPrimaryPath, resolveRunIndexPath, } from './result-layout.js'; @@ -67,8 +68,7 @@ export async function resolveCachedRunDir(cwd: string): Promise { const dir = path.join(cwd, '.agentv'); - const lastRunDir = - path.basename(resultPath) === RESULT_INDEX_FILENAME ? path.dirname(resultPath) : resultPath; + const lastRunDir = isRunManifestPath(resultPath) ? path.dirname(resultPath) : resultPath; await mkdir(dir, { recursive: true }); const cache: RunCache = { lastRunDir, diff --git a/apps/cli/src/commands/eval/run-eval.ts b/apps/cli/src/commands/eval/run-eval.ts index 7490a6cd4..d801aba20 100644 --- a/apps/cli/src/commands/eval/run-eval.ts +++ b/apps/cli/src/commands/eval/run-eval.ts @@ -64,6 +64,7 @@ import { resolveOtelBackend } from './otel-backends.js'; import { type OutputWriter, createOutputWriter } from './output-writer.js'; import { ProgressDisplay, type Verdict, type WorkerProgress } from './progress-display.js'; import { + RESULT_INDEX_FILENAME, buildDefaultRunDirFromName, createRunDirName, discoverRunManifestPaths, @@ -139,7 +140,7 @@ interface NormalizedOptions { readonly keepWorkspaces: boolean; /** Removed: use --output instead */ readonly artifacts?: string; - /** Removed: the run directory always uses index.jsonl */ + /** Removed: the run directory always uses run_manifest.jsonl */ readonly outputFormat?: string; readonly graderTarget?: string; readonly model?: string; @@ -288,7 +289,7 @@ function outputFileMigrationMessage(value: string): string { ext === '.xml' ? 'JUnit XML export from agentv eval has been removed.' : 'Flat result file export from agentv eval has been removed.'; - return `--output expects a run directory, not a file path: ${value}\n${removalHint} Set --output for the canonical run artifacts; AgentV always writes /index.jsonl.`; + return `--output expects a run directory, not a file path: ${value}\n${removalHint} Set --output for the canonical run artifacts; AgentV always writes /${RESULT_INDEX_FILENAME}.`; } function artifactsMigrationMessage(artifactsDir: string, outputDir?: string): string { @@ -1076,7 +1077,7 @@ class BundleOutputWriter implements OutputWriter { } const dir = resultBundleDir(this.invocationDir, result); mkdirSync(dir, { recursive: true }); - const indexPath = path.join(dir, 'index.jsonl'); + const indexPath = path.join(dir, RESULT_INDEX_FILENAME); const writer = await createOutputWriter(indexPath, { append: this.appendMode }); this.writers.set(key, { dir, indexPath, writer }); return writer; @@ -1682,7 +1683,7 @@ export async function runEvalCommand( } if (options.outputFormat) { throw new Error( - '--output-format was removed from agentv eval. The run directory always writes index.jsonl.', + `--output-format was removed from agentv eval. The run directory always writes ${RESULT_INDEX_FILENAME}.`, ); } if (options.artifacts) { @@ -1754,8 +1755,8 @@ export async function runEvalCommand( `${modeLabel}: found ${existingResults.length} existing result(s), skipping ${resumeSkipKeys.size} completed.`, ); } else { - // No existing bundle index.jsonl — behave like a normal run - console.log('Resume: no existing bundle index.jsonl found, starting fresh run.'); + // No existing bundle manifest — behave like a normal run. + console.log('Resume: no existing bundle run manifest found, starting fresh run.'); } } else { console.warn( @@ -2430,7 +2431,7 @@ export async function runEvalCommand( } if (isResumeAppend) { // Resume mode: write per-test artifacts for newly-run tests, then - // aggregate each bundle from its full index.jsonl (old + new results + // aggregate each bundle from its full row manifest (old + new results // with deduplication). const { writePerTestArtifacts } = await import('./artifact-writer.js'); for (const bundleResults of resultsByBundle.values()) { @@ -2450,9 +2451,9 @@ export async function runEvalCommand( experimentMetadata: runExperimentMetadata, runtimeSource: runtimeSourceMetadata, }); - const indexPath = path.join(bundleDir, 'index.jsonl'); + const indexPath = path.join(bundleDir, RESULT_INDEX_FILENAME); console.log(`Artifact bundle updated: ${bundleDir}`); - console.log(` Index: ${indexPath}`); + console.log(` Run manifest: ${indexPath}`); console.log( ` Per-test artifacts: ${bundleDir} (${bundleResults.length} new test directories)`, ); @@ -2477,7 +2478,7 @@ export async function runEvalCommand( }, ); console.log(`Artifact bundle written to: ${bundleDir}`); - console.log(` Index: ${indexPath}`); + console.log(` Run manifest: ${indexPath}`); console.log( ` Per-test artifacts: ${testArtifactDir} (${bundleResults.length} test directories)`, ); diff --git a/apps/cli/src/commands/grade/index.ts b/apps/cli/src/commands/grade/index.ts index aeb9b17c4..6d192e838 100644 --- a/apps/cli/src/commands/grade/index.ts +++ b/apps/cli/src/commands/grade/index.ts @@ -298,7 +298,7 @@ function printHumanOutput(result: GradePreparedResult): void { console.log(`Trace: ${result.tracePath}`); } console.log(`Artifact workspace: ${result.outputDir}`); - console.log(`Index: ${result.indexPath}`); + console.log(`Run manifest: ${result.indexPath}`); } function isTraceEnvelopeDocument(value: unknown): boolean { @@ -625,7 +625,7 @@ export const gradeCommand = command({ type: optional(string), long: 'output', short: 'o', - description: 'Run artifact directory (writes index.jsonl and per-test artifacts)', + description: 'Run artifact directory (writes run_manifest.jsonl and per-test artifacts)', }), response: option({ type: optional(string), diff --git a/apps/cli/src/commands/inspect/filter.ts b/apps/cli/src/commands/inspect/filter.ts index 2b453db23..7e32aaedf 100644 --- a/apps/cli/src/commands/inspect/filter.ts +++ b/apps/cli/src/commands/inspect/filter.ts @@ -14,7 +14,10 @@ import { existsSync, readFileSync, readdirSync, statSync } from 'node:fs'; import path from 'node:path'; import { command, number, oneOf, option, optional, positional, string } from 'cmd-ts'; -import { isReservedResultsNamespace } from '../eval/result-layout.js'; +import { + isReservedResultsNamespace, + resolveExistingRunPrimaryPath, +} from '../eval/result-layout.js'; import { normalizeResultRow } from '../results/result-row-schema.js'; import { c, formatScore, padLeft, padRight } from './utils.js'; @@ -34,9 +37,14 @@ export interface FilterableRecord { } /** - * Recursively collect all index.jsonl files under the runs directory. + * Recursively collect one run manifest per bundle under the runs directory. */ function collectIndexFiles(dir: string): string[] { + const primaryPath = resolveExistingRunPrimaryPath(dir); + if (primaryPath) { + return [primaryPath]; + } + const files: string[] = []; try { const entries = readdirSync(dir, { withFileTypes: true }); @@ -44,8 +52,6 @@ function collectIndexFiles(dir: string): string[] { const fullPath = path.join(dir, entry.name); if (entry.isDirectory()) { files.push(...collectIndexFiles(fullPath)); - } else if (entry.name === 'index.jsonl') { - files.push(fullPath); } } } catch { @@ -64,8 +70,6 @@ function collectCurrentResultIndexFiles(cwd: string): string[] { const fullPath = path.join(resultsDir, entry.name); if (entry.isDirectory()) { files.push(...collectIndexFiles(fullPath)); - } else if (entry.name === 'index.jsonl') { - files.push(fullPath); } } } catch { diff --git a/apps/cli/src/commands/inspect/score.ts b/apps/cli/src/commands/inspect/score.ts index 75244e827..9ed390668 100644 --- a/apps/cli/src/commands/inspect/score.ts +++ b/apps/cli/src/commands/inspect/score.ts @@ -379,7 +379,7 @@ export const traceScoreCommand = command({ ); if (!hasTrace) { console.error( - `${c.red}Error:${c.reset} Source lacks trace metrics. Use an OTLP trace export via ${c.bold}--otel-file${c.reset} or a run manifest with summary metrics in ${c.bold}index.jsonl${c.reset}.`, + `${c.red}Error:${c.reset} Source lacks trace metrics. Use an OTLP trace export via ${c.bold}--otel-file${c.reset} or a run manifest with summary metrics in ${c.bold}run_manifest.jsonl${c.reset}.`, ); process.exit(1); } diff --git a/apps/cli/src/commands/inspect/search.ts b/apps/cli/src/commands/inspect/search.ts index a586d8a4e..76b9c9286 100644 --- a/apps/cli/src/commands/inspect/search.ts +++ b/apps/cli/src/commands/inspect/search.ts @@ -6,7 +6,7 @@ * content with surrounding context. * * Supported sources: - * - Run result manifests (index.jsonl) — searches serialized JSON content + * - Run result manifests — searches serialized JSON content * - Transcript JSONL files — searches message content and tool call data * * To extend: add new scanners in the `scanSources()` function for additional diff --git a/apps/cli/src/commands/inspect/utils.ts b/apps/cli/src/commands/inspect/utils.ts index fa6101b5f..3c95a4879 100644 --- a/apps/cli/src/commands/inspect/utils.ts +++ b/apps/cli/src/commands/inspect/utils.ts @@ -3,9 +3,9 @@ import path from 'node:path'; import type { EvaluationResult, TraceSummary } from '@agentv/core'; import { DEFAULT_THRESHOLD, toCamelCaseDeep, toSnakeCaseDeep } from '@agentv/core'; import { - RESULT_INDEX_FILENAME, buildResultsRootDir, isReservedResultsNamespace, + isRunManifestPath, resolveExistingRunPrimaryPath, resolveWorkspaceOrFilePath, } from '../eval/result-layout.js'; @@ -105,7 +105,7 @@ export interface RawTraceSpan { * Load all result or trace records from a supported source. * * Supported sources: - * - Run workspace directories / index.jsonl manifests + * - Run workspace directories / run manifests * - Standalone trace JSONL files for trace-only workflows * - OTLP JSON trace files written via --otel-file */ @@ -116,7 +116,7 @@ export function loadResultFile(filePath: string): RawResult[] { return loadOtlpTraceFile(resolvedFilePath); } - if (path.basename(resolvedFilePath) === RESULT_INDEX_FILENAME) { + if (isRunManifestPath(resolvedFilePath)) { return loadManifestAsRawResults(resolvedFilePath); } diff --git a/apps/cli/src/commands/pipeline/bench.ts b/apps/cli/src/commands/pipeline/bench.ts index fed6dce88..415339da6 100644 --- a/apps/cli/src/commands/pipeline/bench.ts +++ b/apps/cli/src/commands/pipeline/bench.ts @@ -6,7 +6,7 @@ * * Writes: * - /grading.json (per-test grading breakdown) - * - index.jsonl (one line per test) + * - run_manifest.jsonl (one line per test) * - summary.json (aggregate statistics) */ import { existsSync } from 'node:fs'; @@ -15,7 +15,7 @@ import { join } from 'node:path'; import { command, positional, string } from 'cmd-ts'; -import { DEFAULT_THRESHOLD, type EvaluationResult } from '@agentv/core'; +import { DEFAULT_THRESHOLD, type EvaluationResult, RESULT_INDEX_FILENAME } from '@agentv/core'; import { maybeAutoExportRunArtifacts } from '../results/remote.js'; interface EvaluatorScore { @@ -192,9 +192,9 @@ export const evalBenchCommand = command({ ); } - // Write index.jsonl + // Write row-level run manifest. await writeFile( - join(exportDir, 'index.jsonl'), + join(exportDir, RESULT_INDEX_FILENAME), indexLines.length > 0 ? `${indexLines.join('\n')}\n` : '', 'utf8', ); @@ -202,6 +202,7 @@ export const evalBenchCommand = command({ // Write summary.json const passRateStats = computeStats(allPassRates); const summary = { + manifest_path: RESULT_INDEX_FILENAME, metadata: { eval_file: manifest.eval_file, timestamp: manifest.timestamp, diff --git a/apps/cli/src/commands/results/combine-run.ts b/apps/cli/src/commands/results/combine-run.ts index f8af0f8b4..56b0382b6 100644 --- a/apps/cli/src/commands/results/combine-run.ts +++ b/apps/cli/src/commands/results/combine-run.ts @@ -34,6 +34,7 @@ import { buildTestTargetKey, } from '../eval/artifact-writer.js'; import { + RESULT_INDEX_FILENAME, buildDefaultRunDirFromName, createRunDirName, normalizeExperimentName, @@ -611,7 +612,7 @@ export function combineRunSources(options: CombineRunOptions): CombineRunResult mkdirSync(runDir, { recursive: true }); const records = rows.map((row) => rewriteAndCopyRecord(row, runDir, experiment)); - const manifestPath = path.join(runDir, 'index.jsonl'); + const manifestPath = path.join(runDir, RESULT_INDEX_FILENAME); writeJsonl(manifestPath, records); const summary = buildRunSummaryArtifact(results, '', 'combined', results.length); diff --git a/apps/cli/src/commands/results/combine.ts b/apps/cli/src/commands/results/combine.ts index 4566df69f..913c4fc71 100644 --- a/apps/cli/src/commands/results/combine.ts +++ b/apps/cli/src/commands/results/combine.ts @@ -96,7 +96,7 @@ export const resultsCombineCommand = command({ sources: restPositionals({ type: string, displayName: 'source', - description: 'Run workspace directory or index.jsonl manifest', + description: 'Run workspace directory or run manifest', }), output: option({ type: optional(string), @@ -125,7 +125,7 @@ export const resultsCombineCommand = command({ }, handler: async (args) => { if (args.sources.length < 2) { - console.error('Error: provide at least two run workspaces or index.jsonl manifests'); + console.error('Error: provide at least two run workspaces or run manifests'); process.exit(1); } diff --git a/apps/cli/src/commands/results/delete-run.ts b/apps/cli/src/commands/results/delete-run.ts index f8f3bc90f..125f5bfc2 100644 --- a/apps/cli/src/commands/results/delete-run.ts +++ b/apps/cli/src/commands/results/delete-run.ts @@ -3,7 +3,7 @@ * Dashboard API. * * Deletes exactly one local run workspace directory under `.agentv/results/`. - * Callers may pass a run ID, run workspace directory, or `index.jsonl` path. + * Callers may pass a run ID, run workspace directory, or run manifest path. * Remote runs and paths outside the local results tree are rejected before * anything is removed. */ @@ -12,7 +12,9 @@ import { existsSync, rmSync } from 'node:fs'; import path from 'node:path'; import { + LEGACY_RESULT_INDEX_FILENAME, RESULT_INDEX_FILENAME, + isRunManifestPath, relativeRunPathFromCwd, resolveRunManifestPath, } from '../eval/result-layout.js'; @@ -32,8 +34,10 @@ export interface DeleteRunResult extends DeleteRunTarget { function assertLocalRunManifest(cwd: string, manifestPath: string, runId: string): DeleteRunTarget { const resolvedManifestPath = path.resolve(manifestPath); - if (path.basename(resolvedManifestPath) !== RESULT_INDEX_FILENAME) { - throw new Error('Expected a run workspace directory or index.jsonl manifest'); + if (!isRunManifestPath(resolvedManifestPath)) { + throw new Error( + `Expected a run workspace directory or ${RESULT_INDEX_FILENAME} manifest (legacy ${LEGACY_RESULT_INDEX_FILENAME} is also readable)`, + ); } const runDir = path.dirname(resolvedManifestPath); diff --git a/apps/cli/src/commands/results/delete.ts b/apps/cli/src/commands/results/delete.ts index a16b5b5c0..5b50b05ab 100644 --- a/apps/cli/src/commands/results/delete.ts +++ b/apps/cli/src/commands/results/delete.ts @@ -2,7 +2,7 @@ * `agentv results delete` — remove one or more local run workspaces. * * The command requires confirmation unless `--yes` is passed. It accepts local - * run IDs, run workspace directories, or `index.jsonl` manifests and refuses + * run IDs, run workspace directories, or run manifests and refuses * remote runs. */ @@ -28,7 +28,7 @@ export const resultsDeleteCommand = command({ runs: restPositionals({ type: string, displayName: 'run', - description: 'Local run ID, run workspace directory, or index.jsonl manifest', + description: 'Local run ID, run workspace directory, or run manifest', }), yes: flag({ long: 'yes', diff --git a/apps/cli/src/commands/results/eval-runner.ts b/apps/cli/src/commands/results/eval-runner.ts index 4be84255e..a3d54b6af 100644 --- a/apps/cli/src/commands/results/eval-runner.ts +++ b/apps/cli/src/commands/results/eval-runner.ts @@ -29,7 +29,11 @@ import type { Hono } from 'hono'; import { TARGET_FILE_CANDIDATES } from '../../utils/targets.js'; import { discoverEvalFiles } from '../eval/discover.js'; -import { buildDefaultRunDir, normalizeExperimentName } from '../eval/result-layout.js'; +import { + RESULT_INDEX_FILENAME, + buildDefaultRunDir, + normalizeExperimentName, +} from '../eval/result-layout.js'; import { findRepoRoot } from '../eval/shared.js'; import { normalizeTags, writeRunTags } from './run-tags.js'; @@ -74,12 +78,12 @@ function pruneFinishedRuns() { } /** - * Look up the target for a Dashboard-launched run by its index.jsonl path. + * Look up the target for a Dashboard-launched run by its run manifest path. * Called by handleRuns in serve.ts when the JSONL has 0 records (run just started). */ export function getActiveRunTarget(indexJsonlPath: string): string | undefined { for (const run of activeRuns.values()) { - if (run.outputDir && path.join(run.outputDir, 'index.jsonl') === indexJsonlPath) { + if (run.outputDir && path.join(run.outputDir, RESULT_INDEX_FILENAME) === indexJsonlPath) { return run.target; } } @@ -87,14 +91,14 @@ export function getActiveRunTarget(indexJsonlPath: string): string | undefined { } /** - * Look up the in-memory status for a Dashboard-launched run by its index.jsonl path. + * Look up the in-memory status for a Dashboard-launched run by its manifest path. * Returns 'starting' | 'running' | 'finished' | 'failed' if the run is tracked, * else undefined. Used by handleRuns to render a spinner for active runs in the * RunList instead of a misleading red ✗ derived from a 0 pass-rate. */ export function getActiveRunStatus(indexJsonlPath: string): DashboardRun['status'] | undefined { for (const run of activeRuns.values()) { - if (run.outputDir && path.join(run.outputDir, 'index.jsonl') === indexJsonlPath) { + if (run.outputDir && path.join(run.outputDir, RESULT_INDEX_FILENAME) === indexJsonlPath) { return run.status; } } @@ -149,7 +153,7 @@ interface RunEvalRequest { resume?: boolean; /** Re-run failed/errored tests while keeping passing results. */ rerun_failed?: boolean; - /** Path to a previous run dir or index.jsonl — re-run only execution_error cases. */ + /** Path to a previous run dir or run manifest — re-run only execution_error cases. */ retry_errors?: string; /** Artifact directory for run output. Required when resume/rerun_failed are set without auto-detect. */ output?: string; @@ -330,7 +334,7 @@ function openConsoleLogStream(outputDir: string): WriteStream | undefined { function writeInitialRunTags(outputDir: string, tags: readonly string[]): void { if (tags.length === 0) return; mkdirSync(outputDir, { recursive: true }); - writeRunTags(path.join(outputDir, 'index.jsonl'), tags); + writeRunTags(path.join(outputDir, RESULT_INDEX_FILENAME), tags); } // ── Route registration ─────────────────────────────────────────────────── @@ -504,7 +508,7 @@ export function registerEvalRoutes( // ── Stop a running eval ──────────────────────────────────────────────── // POST (not DELETE) because Stop is part of the stop → resume → complete // workflow, not a destructive cancel. The run remains resumable from the - // partial index.jsonl on disk. Idempotent: hitting /stop on a terminal + // partial run manifest on disk. Idempotent: hitting /stop on a terminal // run returns 200 with `stopped: false, reason: 'already_terminal'` // rather than 4xx, so clients can fire-and-forget. // diff --git a/apps/cli/src/commands/results/export.ts b/apps/cli/src/commands/results/export.ts index 83dbec6f0..f6d3b0dd1 100644 --- a/apps/cli/src/commands/results/export.ts +++ b/apps/cli/src/commands/results/export.ts @@ -1,11 +1,11 @@ /** - * `agentv results export` — converts a canonical run workspace or index.jsonl + * `agentv results export` — converts a canonical run workspace or run manifest * manifest into a directory structure matching the artifact-writer output format. * * Output structure: * / * summary.json — run aggregate scores, metadata, and timing - * index.jsonl — per-test manifest with artifact pointers + * run_manifest.jsonl — per-test manifest with artifact pointers * / * summary.json — per-case aggregate * run-1/result.json — per-run result @@ -28,7 +28,12 @@ import { command, flag, oneOf, option, optional, positional, string } from 'cmd- import type { EvaluationResult, ExportDuplicatePolicy, IndexArtifactEntry } from '@agentv/core'; import { parseJsonlResults, writeArtifactsFromResults } from '../eval/artifact-writer.js'; -import { RESULT_INDEX_FILENAME, isReservedResultsNamespace } from '../eval/result-layout.js'; +import { + LEGACY_RESULT_INDEX_FILENAME, + RESULT_INDEX_FILENAME, + isReservedResultsNamespace, + isRunManifestPath, +} from '../eval/result-layout.js'; import { loadManifestResults } from './manifest.js'; import { type ProjectionBundle, @@ -65,8 +70,10 @@ export async function exportResults( * Derive the default output directory from a run manifest path. */ export function deriveOutputDir(cwd: string, sourceFile: string): string { - if (path.basename(sourceFile) !== RESULT_INDEX_FILENAME) { - throw new Error(`Expected a run manifest named ${RESULT_INDEX_FILENAME}: ${sourceFile}`); + if (!isRunManifestPath(sourceFile)) { + throw new Error( + `Expected a run manifest named ${RESULT_INDEX_FILENAME} (legacy ${LEGACY_RESULT_INDEX_FILENAME} is also readable): ${sourceFile}`, + ); } const runDir = path.dirname(sourceFile); @@ -87,7 +94,7 @@ export function deriveOutputDir(cwd: string, sourceFile: string): string { } export function deriveExportRunId(sourceFile: string): string { - if (path.basename(sourceFile) === RESULT_INDEX_FILENAME) { + if (isRunManifestPath(sourceFile)) { return path.basename(path.dirname(sourceFile)); } return path.basename(sourceFile, path.extname(sourceFile)); @@ -136,13 +143,13 @@ export function buildProjectionBundleFromExportedIndex(options: { export const resultsExportCommand = command({ name: 'export', - description: 'Export a run workspace or index.jsonl manifest into a per-test directory structure', + description: 'Export a run workspace or run manifest into a per-test directory structure', args: { source: positional({ type: optional(string), displayName: 'source', description: - 'Run workspace directory or index.jsonl manifest to export (defaults to most recent in .agentv/results/)', + 'Run workspace directory or run manifest to export (defaults to most recent in .agentv/results/)', }), out: option({ type: optional(string), diff --git a/apps/cli/src/commands/results/manifest.ts b/apps/cli/src/commands/results/manifest.ts index e2b017eb8..823be5a1f 100644 --- a/apps/cli/src/commands/results/manifest.ts +++ b/apps/cli/src/commands/results/manifest.ts @@ -16,8 +16,8 @@ import { import type { GradingArtifact, TimingArtifact } from '../eval/artifact-writer.js'; import { - RESULT_INDEX_FILENAME, isDirectoryPath, + isRunManifestPath, resolveRunManifestPath, } from '../eval/result-layout.js'; import { normalizeResultRow } from './result-row-schema.js'; @@ -289,7 +289,7 @@ export function parseResultManifest(content: string): ResultManifestRecord[] { export function resolveResultSourcePath(source: string, cwd?: string): string { const resolved = path.isAbsolute(source) ? source : path.resolve(cwd ?? process.cwd(), source); - if (isDirectoryPath(resolved) || path.basename(resolved) === RESULT_INDEX_FILENAME) { + if (isDirectoryPath(resolved) || isRunManifestPath(resolved)) { return resolveRunManifestPath(resolved); } return resolved; diff --git a/apps/cli/src/commands/results/remote.ts b/apps/cli/src/commands/results/remote.ts index 222552098..bea2d0db2 100644 --- a/apps/cli/src/commands/results/remote.ts +++ b/apps/cli/src/commands/results/remote.ts @@ -23,7 +23,7 @@ import { syncResultsRepoForProject, } from '@agentv/core'; -import { relativeRunPathFromCwd } from '../eval/result-layout.js'; +import { RESULT_INDEX_FILENAME, relativeRunPathFromCwd } from '../eval/result-layout.js'; import { findRepoRoot } from '../eval/shared.js'; import { type ResultFileMeta, @@ -149,7 +149,7 @@ function remoteMetadataManifestPath( if (!relativeRunPath) { return undefined; } - return path.join(config.path, 'runs', ...relativeRunPath.split('/'), 'index.jsonl'); + return path.join(config.path, 'runs', ...relativeRunPath.split('/'), RESULT_INDEX_FILENAME); } export interface ResultsPublishOverrides { diff --git a/apps/cli/src/commands/results/report.ts b/apps/cli/src/commands/results/report.ts index cf3dc95ef..123ac1418 100644 --- a/apps/cli/src/commands/results/report.ts +++ b/apps/cli/src/commands/results/report.ts @@ -296,7 +296,7 @@ export async function writeResultsReport( export const resultsReportCommand = command({ name: 'report', - description: 'Generate a static HTML report from a run workspace or index.jsonl manifest', + description: 'Generate a static HTML report from a run workspace or run manifest', args: { source: sourceArg, out: option({ diff --git a/apps/cli/src/commands/results/run-tags.ts b/apps/cli/src/commands/results/run-tags.ts index 992862293..449827d65 100644 --- a/apps/cli/src/commands/results/run-tags.ts +++ b/apps/cli/src/commands/results/run-tags.ts @@ -1,7 +1,7 @@ /** * Per-run tag sidecar file helpers. * - * Tags are stored as a `tags.json` sidecar next to the run's `index.jsonl` + * Tags are stored as a `tags.json` sidecar next to the run manifest * manifest. The sidecar is optional, mutable, and non-breaking — absence * means the run has no user-assigned tags. * @@ -52,7 +52,7 @@ export interface RunTagsFile { tag_revision: string; } -/** Resolve the tags sidecar path given a run manifest (index.jsonl) path. */ +/** Resolve the tags sidecar path given a run manifest path. */ export function runTagsPath(manifestPath: string): string { return path.join(path.dirname(manifestPath), RUN_TAGS_FILENAME); } diff --git a/apps/cli/src/commands/results/serve.ts b/apps/cli/src/commands/results/serve.ts index 9f609e831..d35facc6f 100644 --- a/apps/cli/src/commands/results/serve.ts +++ b/apps/cli/src/commands/results/serve.ts @@ -77,7 +77,11 @@ import { Hono } from 'hono'; import { enforceRequiredVersion } from '../../version-check.js'; import { parseJsonlResults } from '../eval/artifact-writer.js'; -import { relativeRunPathFromCwd } from '../eval/result-layout.js'; +import { + RESULT_INDEX_FILENAME, + isRunManifestPath, + relativeRunPathFromCwd, +} from '../eval/result-layout.js'; import { loadRunCache, resolveRunCacheFile } from '../eval/run-cache.js'; import { findRepoRoot } from '../eval/shared.js'; import { listResultFiles } from '../inspect/utils.js'; @@ -131,7 +135,7 @@ const DIRECT_DASHBOARD_SOURCE_GUIDANCE = [ 'Run it from a project root, or pass --dir so Dashboard uses /.agentv/results/:', ' agentv dashboard --dir ', 'To browse external results, configure results.repo.remote or results.repo.path in config YAML.', - 'For a one-off run bundle, use: agentv results report ', + `For a one-off run bundle, use: agentv results report `, ].join('\n'); function unsupportedDashboardSourceError(source: string, cwd: string): Error { @@ -679,7 +683,8 @@ function manifestRecordSelection( function relativeRunPathFromNormalizedManifestPath(manifestPath: string): string | undefined { const parts = manifestPath.split('/').filter(Boolean); const runsIndex = parts.lastIndexOf('runs'); - if (runsIndex === -1 || parts.at(-1) !== 'index.jsonl') { + const manifestName = parts.at(-1); + if (runsIndex === -1 || !manifestName || !isRunManifestPath(manifestName)) { return undefined; } const runParts = parts.slice(runsIndex + 1, -1); @@ -3031,7 +3036,7 @@ function validateLocalCompletedRun( } const manifestPath = path.resolve(meta.path); - if (path.basename(manifestPath) !== 'index.jsonl') { + if (!isRunManifestPath(manifestPath)) { return { error: 'Run workspace is invalid', status: 400 }; } diff --git a/apps/cli/src/commands/results/shared.ts b/apps/cli/src/commands/results/shared.ts index c73513246..9029bfc53 100644 --- a/apps/cli/src/commands/results/shared.ts +++ b/apps/cli/src/commands/results/shared.ts @@ -23,7 +23,7 @@ export const sourceArg = positional({ type: optional(string), displayName: 'source', description: - 'Run workspace directory or index.jsonl manifest (defaults to most recent in .agentv/results/)', + 'Run workspace directory or run manifest (defaults to most recent in .agentv/results/)', }); /** diff --git a/apps/cli/src/commands/results/validate.ts b/apps/cli/src/commands/results/validate.ts index b75288498..f6047c459 100644 --- a/apps/cli/src/commands/results/validate.ts +++ b/apps/cli/src/commands/results/validate.ts @@ -4,12 +4,12 @@ * * Checks: * 1. Directory follows the `.agentv/results//` naming convention - * 2. index.jsonl exists and each line has required fields + * 2. run_manifest.jsonl exists and each line has required fields * 3. Per-case summary.json exists for every entry in the index * 4. Per-run result.json and grading.json exist for every materialized trial * 5. summary.json exists * 6. Scores are within [0, 1] - * 7. index.jsonl entries have `scores[]` array (warning if missing — dashboard needs it) + * 7. Run manifest entries have `scores[]` array (warning if missing — dashboard needs it) * * Exit code 0 = valid, 1 = errors found. * @@ -20,6 +20,12 @@ import path from 'node:path'; import { command, positional, string } from 'cmd-ts'; +import { + LEGACY_RESULT_INDEX_FILENAME, + RESULT_INDEX_FILENAME, + resolveExistingRunPrimaryPath, +} from '../eval/result-layout.js'; + // ── Types ──────────────────────────────────────────────────────────────── interface Diagnostic { @@ -94,12 +100,16 @@ export function validateRunDirectory(runDir: string): { } function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: IndexEntry[] } { - const indexPath = path.join(runDir, 'index.jsonl'); + const indexPath = resolveExistingRunPrimaryPath(runDir); const diagnostics: Diagnostic[] = []; const entries: IndexEntry[] = []; + const manifestLabel = indexPath ? path.basename(indexPath) : RESULT_INDEX_FILENAME; - if (!existsSync(indexPath)) { - diagnostics.push({ severity: 'error', message: 'index.jsonl is missing' }); + if (!indexPath || !existsSync(indexPath)) { + diagnostics.push({ + severity: 'error', + message: `${RESULT_INDEX_FILENAME} is missing (legacy ${LEGACY_RESULT_INDEX_FILENAME} is also readable)`, + }); return { diagnostics, entries }; } @@ -107,7 +117,7 @@ function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: const lines = content.split('\n').filter((l) => l.trim().length > 0); if (lines.length === 0) { - diagnostics.push({ severity: 'error', message: 'index.jsonl is empty' }); + diagnostics.push({ severity: 'error', message: `${manifestLabel} is empty` }); return { diagnostics, entries }; } @@ -119,40 +129,40 @@ function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: if (!entry.test_id) { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1}: missing 'test_id'`, + message: `${manifestLabel} line ${i + 1}: missing 'test_id'`, }); } if (entry.score === undefined || entry.score === null) { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): missing 'score'`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): missing 'score'`, }); } else if (typeof entry.score !== 'number' || entry.score < 0 || entry.score > 1) { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): score ${entry.score} is outside [0, 1]`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): score ${entry.score} is outside [0, 1]`, }); } if (!entry.target) { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): missing 'target'`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): missing 'target'`, }); } if (!entry.summary_path) { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): missing 'summary_path'`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): missing 'summary_path'`, }); } if (typeof entry.trace_path === 'string') { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): trace_path is no longer supported; use transcript_path and metrics_path`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): trace_path is no longer supported; use transcript_path and metrics_path`, }); } @@ -165,14 +175,14 @@ function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: ) { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): artifact_pointers.trace is no longer supported`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): artifact_pointers.trace is no longer supported`, }); } if (!entry.scores || !Array.isArray(entry.scores) || entry.scores.length === 0) { diagnostics.push({ severity: 'warning', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): missing 'scores[]' array — dashboard may not show per-grader breakdown`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): missing 'scores[]' array — dashboard may not show per-grader breakdown`, }); } else { for (let j = 0; j < entry.scores.length; j++) { @@ -180,7 +190,7 @@ function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: if (!s || typeof s !== 'object') { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): scores[${j}] is not an object`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): scores[${j}] is not an object`, }); continue; } @@ -192,7 +202,7 @@ function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: if (missing.length > 0) { diagnostics.push({ severity: 'warning', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): scores[${j}] missing fields: ${missing.join(', ')}`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): scores[${j}] missing fields: ${missing.join(', ')}`, }); } } @@ -201,18 +211,18 @@ function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: if (!entry.execution_status) { diagnostics.push({ severity: 'warning', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): missing 'execution_status'`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): missing 'execution_status'`, }); } else if (!['ok', 'quality_failure', 'execution_error'].includes(entry.execution_status)) { diagnostics.push({ severity: 'warning', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): unknown execution_status '${entry.execution_status}' (expected: ok, quality_failure, execution_error)`, + message: `${manifestLabel} line ${i + 1} (${entry.test_id ?? '?'}): unknown execution_status '${entry.execution_status}' (expected: ok, quality_failure, execution_error)`, }); } } catch { diagnostics.push({ severity: 'error', - message: `index.jsonl line ${i + 1}: invalid JSON`, + message: `${manifestLabel} line ${i + 1}: invalid JSON`, }); } } diff --git a/apps/cli/src/commands/runs/rerun.ts b/apps/cli/src/commands/runs/rerun.ts index d05c520db..465efa3ec 100644 --- a/apps/cli/src/commands/runs/rerun.ts +++ b/apps/cli/src/commands/runs/rerun.ts @@ -401,7 +401,7 @@ export const runsRerunCommand = command({ runDir: positional({ type: string, displayName: 'run-dir', - description: 'Run workspace directory or index.jsonl manifest containing task bundles', + description: 'Run workspace directory or run manifest containing task bundles', }), testId: multioption({ type: array(string), diff --git a/apps/cli/src/commands/trend/index.ts b/apps/cli/src/commands/trend/index.ts index c9dadf7a2..01ce70792 100644 --- a/apps/cli/src/commands/trend/index.ts +++ b/apps/cli/src/commands/trend/index.ts @@ -3,7 +3,7 @@ import path from 'node:path'; import { command, flag, number, oneOf, option, optional, restPositionals, string } from 'cmd-ts'; import { toSnakeCaseDeep } from '../../utils/case-conversion.js'; -import { RESULT_INDEX_FILENAME } from '../eval/result-layout.js'; +import { RESULT_INDEX_FILENAME, isRunManifestPath } from '../eval/result-layout.js'; import { listResultFiles } from '../inspect/utils.js'; import { type LightweightResultRecord, @@ -118,7 +118,7 @@ function colorizeSlope(value: number): string { function ensureTrendIndexPath(source: string, cwd: string): string { const resolved = resolveResultSourcePath(source, cwd); - if (path.basename(resolved) !== RESULT_INDEX_FILENAME) { + if (!isRunManifestPath(resolved)) { throw new Error( `Unsupported result source for trend: ${source}. Use a run workspace directory or ${RESULT_INDEX_FILENAME} manifest.`, ); @@ -148,7 +148,7 @@ export function resolveTrendSources( } const metas = listResultFiles(cwd) - .filter((meta) => path.basename(meta.path) === RESULT_INDEX_FILENAME) + .filter((meta) => isRunManifestPath(meta.path)) .slice(0, last); if (metas.length < 2) { @@ -409,7 +409,7 @@ export const trendCommand = command({ runs: restPositionals({ type: string, displayName: 'runs', - description: 'Run workspace directories or index.jsonl manifest paths', + description: 'Run workspace directories or run manifest paths', }), last: option({ type: optional(number), diff --git a/apps/cli/test/commands/compare/compare.test.ts b/apps/cli/test/commands/compare/compare.test.ts index 8b082071b..b9b38b40f 100644 --- a/apps/cli/test/commands/compare/compare.test.ts +++ b/apps/cli/test/commands/compare/compare.test.ts @@ -27,10 +27,10 @@ describe('compare command', () => { }); describe('loadJsonlResults', () => { - it('should load index.jsonl manifests from a run workspace', () => { + it('should load run_manifest.jsonl manifests from a run workspace', () => { const runDir = path.join(tempDir, 'eval_2026-03-24T00-00-00-000Z'); mkdirSync(runDir, { recursive: true }); - const filePath = path.join(runDir, 'index.jsonl'); + const filePath = path.join(runDir, 'run_manifest.jsonl'); writeFileSync( filePath, '{"test_id": "case-1", "score": 0.8, "grading_path": "case-1/grading.json", "timing_path": "case-1/timing.json"}\n{"test_id": "case-2", "score": 0.9, "grading_path": "case-2/grading.json", "timing_path": "case-2/timing.json"}\n', @@ -44,10 +44,39 @@ describe('compare command', () => { ]); }); - it('should handle empty lines in index.jsonl manifests', () => { + it('should prefer summary.json manifest_path over a legacy index.jsonl in the same workspace', () => { + const runDir = path.join(tempDir, 'eval_2026-03-24T00-00-00-000Z'); + mkdirSync(runDir, { recursive: true }); + writeFileSync( + path.join(runDir, 'run_manifest.jsonl'), + '{"test_id": "canonical", "score": 0.8}\n', + ); + writeFileSync(path.join(runDir, 'index.jsonl'), '{"test_id": "legacy", "score": 0.1}\n'); + writeFileSync( + path.join(runDir, 'summary.json'), + `${JSON.stringify({ manifest_path: 'run_manifest.jsonl' })}\n`, + ); + + const results = loadJsonlResults(runDir); + + expect(results).toEqual([{ testId: 'canonical', score: 0.8 }]); + }); + + it('should still accept legacy index.jsonl manifests directly', () => { const runDir = path.join(tempDir, 'eval_2026-03-24T00-00-00-000Z'); mkdirSync(runDir, { recursive: true }); const filePath = path.join(runDir, 'index.jsonl'); + writeFileSync(filePath, '{"test_id": "legacy-case", "score": 0.8}\n'); + + const results = loadJsonlResults(filePath); + + expect(results).toEqual([{ testId: 'legacy-case', score: 0.8 }]); + }); + + it('should handle empty lines in run_manifest.jsonl manifests', () => { + const runDir = path.join(tempDir, 'eval_2026-03-24T00-00-00-000Z'); + mkdirSync(runDir, { recursive: true }); + const filePath = path.join(runDir, 'run_manifest.jsonl'); writeFileSync( filePath, '{"test_id": "case-1", "score": 0.8, "grading_path": "case-1/grading.json", "timing_path": "case-1/timing.json"}\n\n{"test_id": "case-2", "score": 0.9, "grading_path": "case-2/grading.json", "timing_path": "case-2/timing.json"}\n', @@ -87,7 +116,7 @@ describe('compare command', () => { writeFileSync(filePath, '{"test_id": "case-1", "score": 0.8}\n'); expect(() => loadJsonlResults(filePath)).toThrow( - 'Expected a run workspace directory or index.jsonl manifest', + 'Expected a run workspace directory or run_manifest.jsonl manifest', ); }); }); @@ -190,10 +219,10 @@ describe('compare command', () => { expect(groups.get('a')).toHaveLength(2); }); - it('should group records from index.jsonl manifests', () => { + it('should group records from run_manifest.jsonl manifests', () => { const runDir = path.join(tempDir, 'eval_2026-03-24T00-00-00-000Z'); mkdirSync(runDir, { recursive: true }); - const filePath = path.join(runDir, 'index.jsonl'); + const filePath = path.join(runDir, 'run_manifest.jsonl'); writeFileSync( filePath, [ @@ -213,7 +242,7 @@ describe('compare command', () => { writeFileSync(filePath, '{"test_id": "t1", "score": 0.8, "target": "a"}\n'); expect(() => loadCombinedResults(filePath)).toThrow( - 'Expected a run workspace directory or index.jsonl manifest', + 'Expected a run workspace directory or run_manifest.jsonl manifest', ); }); }); diff --git a/apps/cli/test/commands/eval/aggregate.test.ts b/apps/cli/test/commands/eval/aggregate.test.ts index 734b400e5..7c06d44dd 100644 --- a/apps/cli/test/commands/eval/aggregate.test.ts +++ b/apps/cli/test/commands/eval/aggregate.test.ts @@ -15,6 +15,7 @@ import { type EvaluationResult, buildTraceFromMessages } from '@agentv/core'; import { toSnakeCaseDeep } from '../../../src/utils/case-conversion.js'; import { + RESULT_INDEX_FILENAME, aggregateRunDir, deduplicateByTestIdTarget, parseJsonlResults, @@ -46,21 +47,25 @@ function makeResult(overrides: Partial = {}): EvaluationResult }; } -function writeJsonlIndex(dir: string, results: Partial[]): string { - const indexPath = path.join(dir, 'index.jsonl'); +function writeJsonlIndex( + dir: string, + results: Partial[], + filename = RESULT_INDEX_FILENAME, +): string { + const indexPath = path.join(dir, filename); const lines = results.map((r) => JSON.stringify(toSnakeCaseDeep(makeResult(r)))).join('\n'); writeFileSync(indexPath, `${lines}\n`); return indexPath; } function readIndexRows(dir: string): Array<{ test_id: string; result_dir: string }> { - const indexPath = path.join(dir, 'index.jsonl'); + const indexPath = path.join(dir, RESULT_INDEX_FILENAME); if (!existsSync(indexPath)) { return readdirSync(dir) .filter((entry) => /--[a-f0-9]{12}$/.test(entry)) .map((entry) => ({ test_id: entry.replace(/--[a-f0-9]{12}$/, ''), result_dir: entry })); } - return readFileSync(path.join(dir, 'index.jsonl'), 'utf8') + return readFileSync(indexPath, 'utf8') .trim() .split('\n') .filter(Boolean) @@ -200,7 +205,7 @@ describe('aggregateRunDir', () => { rmSync(tmpDir, { recursive: true, force: true }); }); - it('reads index.jsonl, deduplicates, and writes summary.json with timing rollups', async () => { + it('reads run_manifest.jsonl, deduplicates, and writes summary.json with timing rollups', async () => { writeJsonlIndex(tmpDir, [ { testId: 'a', target: 'x', score: 0.1, executionStatus: 'execution_error' }, { testId: 'a', target: 'x', score: 0.9, executionStatus: 'ok' }, @@ -212,12 +217,31 @@ describe('aggregateRunDir', () => { expect(result.targetCount).toBe(1); const summary = JSON.parse(readFileSync(result.summaryPath, 'utf8')); + expect(summary.manifest_path).toBe(RESULT_INDEX_FILENAME); expect(summary.metadata.tests_run).toContain('a'); expect(summary.metadata.tests_run).toContain('b'); expect(summary.run_summary.x).toBeDefined(); expect(summary.timing.total_tokens).toBeGreaterThanOrEqual(0); }); + it('falls back to legacy index.jsonl bundles', async () => { + writeJsonlIndex( + tmpDir, + [ + { testId: 'legacy-a', target: 'x', score: 0.9, executionStatus: 'ok' }, + { testId: 'legacy-b', target: 'x', score: 0.8, executionStatus: 'ok' }, + ], + 'index.jsonl', + ); + + const result = await aggregateRunDir(tmpDir); + expect(result.testCount).toBe(2); + + const summary = JSON.parse(readFileSync(result.summaryPath, 'utf8')); + expect(summary.manifest_path).toBe(RESULT_INDEX_FILENAME); + expect(summary.metadata.tests_run).toEqual(['legacy-a', 'legacy-b']); + }); + it('uses last entry for duplicates in benchmark stats', async () => { writeJsonlIndex(tmpDir, [ { testId: 'a', target: 'x', score: 0.0, executionStatus: 'execution_error' }, diff --git a/apps/cli/test/commands/eval/artifact-writer.test.ts b/apps/cli/test/commands/eval/artifact-writer.test.ts index 2ba014d14..34076c949 100644 --- a/apps/cli/test/commands/eval/artifact-writer.test.ts +++ b/apps/cli/test/commands/eval/artifact-writer.test.ts @@ -1,4 +1,5 @@ import { afterEach, beforeEach, describe, expect, it } from 'bun:test'; +import { existsSync } from 'node:fs'; import { mkdir, readFile, readdir, rm, writeFile } from 'node:fs/promises'; import path from 'node:path'; @@ -19,6 +20,7 @@ import { type AggregateGradingArtifact, type GradingArtifact, type IndexArtifactEntry, + RESULT_INDEX_FILENAME, type RunSummaryArtifact, type TimingArtifact, buildAggregateGradingArtifact, @@ -894,7 +896,7 @@ describe('writeArtifactsFromResults', () => { await rm(testDir, { recursive: true, force: true }).catch(() => undefined); }); - it('writes summary, index, and per-run artifact files', async () => { + it('writes summary, run manifest, and per-run artifact files', async () => { const results = [ makeResult({ testId: 'alpha', score: 0.9, durationMs: 5000 }), makeResult({ testId: 'beta', score: 0.6, durationMs: 8000 }), @@ -904,6 +906,8 @@ describe('writeArtifactsFromResults', () => { evalFile: 'my-eval.yaml', }); + expect(path.basename(paths.indexPath)).toBe('run_manifest.jsonl'); + expect(existsSync(path.join(testDir, 'index.jsonl'))).toBe(false); const indexLines = await readIndexLines(paths.indexPath); expect(indexLines).toHaveLength(2); const alphaRowDir = expectRowDir(indexLines[0], 'alpha'); @@ -915,10 +919,13 @@ describe('writeArtifactsFromResults', () => { expect(artifactEntries.sort()).toEqual([ alphaRowDir, betaRowDir, - 'index.jsonl', + RESULT_INDEX_FILENAME, 'summary.json', ]); + const rootSummary: RunSummaryArtifact = JSON.parse(await readFile(paths.summaryPath, 'utf8')); + expect(rootSummary.manifest_path).toBe(RESULT_INDEX_FILENAME); + const alphaEntries = await readdir(path.join(paths.testArtifactDir, alphaRowDir)); expect(alphaEntries.sort()).toEqual(['run-1', 'summary.json']); @@ -1159,9 +1166,10 @@ describe('writeArtifactsFromResults', () => { const paths = await writeArtifactsFromResults([], testDir); const artifactEntries = await readdir(paths.testArtifactDir); - expect(artifactEntries.sort()).toEqual(['index.jsonl', 'summary.json']); + expect(artifactEntries.sort()).toEqual([RESULT_INDEX_FILENAME, 'summary.json']); const summary: RunSummaryArtifact = JSON.parse(await readFile(paths.summaryPath, 'utf8')); + expect(summary.manifest_path).toBe(RESULT_INDEX_FILENAME); expect(summary.notes).toContain('No results to summarize'); expect(summary.timing.total_tokens).toBe(0); expect(await readFile(paths.indexPath, 'utf8')).toBe(''); @@ -2240,9 +2248,10 @@ describe('writeArtifacts (from JSONL file)', () => { const artifactEntries = await readdir(paths.testArtifactDir); const [indexLine] = await readIndexLines(paths.indexPath); expect(artifactEntries).toContain(expectRowDir(indexLine, 'from-file')); - expect(artifactEntries).toContain('index.jsonl'); + expect(artifactEntries).toContain(RESULT_INDEX_FILENAME); const summary: RunSummaryArtifact = JSON.parse(await readFile(paths.summaryPath, 'utf8')); + expect(summary.manifest_path).toBe(RESULT_INDEX_FILENAME); expect(summary.timing.duration_ms).toBe(12000); expect(summary.timing.total_tokens).toBe(700); }); diff --git a/apps/cli/test/commands/eval/bundle.test.ts b/apps/cli/test/commands/eval/bundle.test.ts index 2e6f3e8fa..98e87390a 100644 --- a/apps/cli/test/commands/eval/bundle.test.ts +++ b/apps/cli/test/commands/eval/bundle.test.ts @@ -166,7 +166,7 @@ tests: ../data/cases.yaml expect(run.exitCode).toBe(0); expect(run.stdout).toContain('RESULT: PASS'); - await expectFileExists(path.join(bundleDir, 'run', 'inherited', 'index.jsonl')); + await expectFileExists(path.join(bundleDir, 'run', 'inherited', 'run_manifest.jsonl')); }, 60_000); it('reports unbundleable workspace references with their eval location', async () => { diff --git a/apps/cli/test/commands/eval/pipeline/bench.test.ts b/apps/cli/test/commands/eval/pipeline/bench.test.ts index 35ebec80e..3936f40ae 100644 --- a/apps/cli/test/commands/eval/pipeline/bench.test.ts +++ b/apps/cli/test/commands/eval/pipeline/bench.test.ts @@ -59,7 +59,7 @@ describe('pipeline bench', () => { await rm(OUT_DIR, { recursive: true, force: true }); }); - it('writes grading, index, and benchmark artifacts', async () => { + it('writes grading, run manifest, and benchmark artifacts', async () => { await writeFile( join(OUT_DIR, 'test-01', 'llm_grader_results', 'relevance.json'), JSON.stringify({ @@ -76,7 +76,7 @@ describe('pipeline bench', () => { expect(grading.assertions.length).toBeGreaterThan(0); expect(grading.graders).toHaveLength(2); - const indexContent = await readFile(join(OUT_DIR, 'index.jsonl'), 'utf8'); + const indexContent = await readFile(join(OUT_DIR, 'run_manifest.jsonl'), 'utf8'); const lines = indexContent .trim() .split('\n') @@ -90,7 +90,7 @@ describe('pipeline bench', () => { expect(benchmark.run_summary['test-target']).toBeDefined(); }, 30_000); - it('propagates experiment from manifest to index.jsonl and summary.json', async () => { + it('propagates experiment from manifest to run_manifest.jsonl and summary.json', async () => { // Overwrite manifest with experiment field await writeFile( join(OUT_DIR, 'manifest.json'), @@ -106,7 +106,7 @@ describe('pipeline bench', () => { const { execa } = await import('execa'); await execa('bun', [CLI_ENTRY, 'pipeline', 'bench', OUT_DIR]); - const indexContent = await readFile(join(OUT_DIR, 'index.jsonl'), 'utf8'); + const indexContent = await readFile(join(OUT_DIR, 'run_manifest.jsonl'), 'utf8'); const entry = JSON.parse(indexContent.trim().split('\n')[0]); expect(entry.experiment).toBe('without_skills'); @@ -118,7 +118,7 @@ describe('pipeline bench', () => { const { execa } = await import('execa'); await execa('bun', [CLI_ENTRY, 'pipeline', 'bench', OUT_DIR]); - const indexContent = await readFile(join(OUT_DIR, 'index.jsonl'), 'utf8'); + const indexContent = await readFile(join(OUT_DIR, 'run_manifest.jsonl'), 'utf8'); const entry = JSON.parse(indexContent.trim().split('\n')[0]); expect(entry.experiment).toBeUndefined(); diff --git a/apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts b/apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts index fef9a62cb..36f8dd7ee 100644 --- a/apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts +++ b/apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts @@ -61,7 +61,7 @@ describe('eval pipeline e2e', () => { expect(grading.graders).toHaveLength(2); expect(grading.summary.pass_rate).toBeGreaterThan(0); - const indexContent = await readFile(join(outDir, 'index.jsonl'), 'utf8'); + const indexContent = await readFile(join(outDir, 'run_manifest.jsonl'), 'utf8'); const indexLines = indexContent .trim() .split('\n') diff --git a/apps/cli/test/commands/eval/result-layout.test.ts b/apps/cli/test/commands/eval/result-layout.test.ts index 79dfd805d..3844b41b5 100644 --- a/apps/cli/test/commands/eval/result-layout.test.ts +++ b/apps/cli/test/commands/eval/result-layout.test.ts @@ -1,11 +1,16 @@ import { describe, expect, it } from 'bun:test'; +import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from 'node:fs'; +import { tmpdir } from 'node:os'; import path from 'node:path'; import { + RESULT_INDEX_FILENAME, buildDefaultRunDir, buildDefaultRunDirFromName, + discoverRunManifestPaths, normalizeExperimentName, relativeRunPathFromCwd, + resolveExistingRunPrimaryPath, } from '../../../src/commands/eval/result-layout.js'; describe('result layout', () => { @@ -41,4 +46,49 @@ describe('result layout', () => { ), ).toBe('default/2026-run'); }); + + it('prefers the summary manifest_path when both manifest filenames exist', () => { + const tempDir = mkdtempSync(path.join(tmpdir(), 'agentv-layout-test-')); + try { + writeFileSync(path.join(tempDir, RESULT_INDEX_FILENAME), '{"test_id":"new"}\n'); + writeFileSync(path.join(tempDir, 'index.jsonl'), '{"test_id":"legacy"}\n'); + writeFileSync( + path.join(tempDir, 'summary.json'), + `${JSON.stringify({ manifest_path: RESULT_INDEX_FILENAME })}\n`, + ); + + expect(resolveExistingRunPrimaryPath(tempDir)).toBe( + path.join(tempDir, RESULT_INDEX_FILENAME), + ); + } finally { + rmSync(tempDir, { recursive: true, force: true }); + } + }); + + it('falls back to legacy index.jsonl when no canonical manifest exists', () => { + const tempDir = mkdtempSync(path.join(tmpdir(), 'agentv-layout-test-')); + try { + writeFileSync(path.join(tempDir, 'index.jsonl'), '{"test_id":"legacy"}\n'); + + expect(resolveExistingRunPrimaryPath(tempDir)).toBe(path.join(tempDir, 'index.jsonl')); + } finally { + rmSync(tempDir, { recursive: true, force: true }); + } + }); + + it('discovers one manifest per nested bundle when both filenames exist', () => { + const tempDir = mkdtempSync(path.join(tmpdir(), 'agentv-layout-test-')); + try { + const bundleDir = path.join(tempDir, 'default', '2026-run', 'target-a'); + mkdirSync(bundleDir, { recursive: true }); + writeFileSync(path.join(bundleDir, RESULT_INDEX_FILENAME), '{"test_id":"new"}\n'); + writeFileSync(path.join(bundleDir, 'index.jsonl'), '{"test_id":"legacy"}\n'); + + expect(discoverRunManifestPaths(tempDir)).toEqual([ + path.join(bundleDir, RESULT_INDEX_FILENAME), + ]); + } finally { + rmSync(tempDir, { recursive: true, force: true }); + } + }); }); diff --git a/apps/cli/test/commands/eval/run-cache.test.ts b/apps/cli/test/commands/eval/run-cache.test.ts index e8a327cbb..f746f522c 100644 --- a/apps/cli/test/commands/eval/run-cache.test.ts +++ b/apps/cli/test/commands/eval/run-cache.test.ts @@ -4,13 +4,13 @@ import path from 'node:path'; import { type RunCache, resolveRunCacheFile } from '../../../src/commands/eval/run-cache.js'; describe('resolveRunCacheFile', () => { - it('should resolve new directory-based cache to index.jsonl inside dir', () => { + it('should resolve new directory-based cache to run_manifest.jsonl inside dir', () => { const cache: RunCache = { lastRunDir: '/results/default/2026-03-24T00-00-00-000Z', timestamp: '', }; expect(resolveRunCacheFile(cache)).toBe( - path.join('/results/default/2026-03-24T00-00-00-000Z', 'index.jsonl'), + path.join('/results/default/2026-03-24T00-00-00-000Z', 'run_manifest.jsonl'), ); }); @@ -29,7 +29,7 @@ describe('resolveRunCacheFile', () => { timestamp: '', }; expect(resolveRunCacheFile(cache)).toBe( - path.join('/results/default/2026-03-24T00-00-00-000Z', 'index.jsonl'), + path.join('/results/default/2026-03-24T00-00-00-000Z', 'run_manifest.jsonl'), ); }); diff --git a/apps/cli/test/commands/grade/grade-prepared.test.ts b/apps/cli/test/commands/grade/grade-prepared.test.ts index f4517adf6..8ece8973b 100644 --- a/apps/cli/test/commands/grade/grade-prepared.test.ts +++ b/apps/cli/test/commands/grade/grade-prepared.test.ts @@ -169,7 +169,7 @@ describe('agentv grade prepared attempts', () => { workspace_path: path.join(preparedDir, 'workspace'), manifest_path: path.join(preparedDir, 'agentv_prepare.json'), output_dir: runDir, - index_path: path.join(runDir, 'index.jsonl'), + index_path: path.join(runDir, 'run_manifest.jsonl'), }); expect(await exists(targetMarker)).toBe(false); @@ -177,7 +177,9 @@ describe('agentv grade prepared attempts', () => { expect(graderPayload.workspace_path).toBe(path.join(preparedDir, 'workspace')); expect(graderPayload.file_changes).toContain('+manual edit'); - const row = JSON.parse((await readFile(path.join(runDir, 'index.jsonl'), 'utf8')).trim()); + const row = JSON.parse( + (await readFile(path.join(runDir, 'run_manifest.jsonl'), 'utf8')).trim(), + ); expect(row).toMatchObject({ test_id: 'case-1', target: 'codex', @@ -269,7 +271,9 @@ describe('agentv grade prepared attempts', () => { ); expect(await exists(targetMarker)).toBe(false); - const row = JSON.parse((await readFile(path.join(runDir, 'index.jsonl'), 'utf8')).trim()); + const row = JSON.parse( + (await readFile(path.join(runDir, 'run_manifest.jsonl'), 'utf8')).trim(), + ); expect(row.score).toBe(0); expect(row.scores[0]).toMatchObject({ name: 'expected-tool-sequence', @@ -376,7 +380,9 @@ describe('agentv grade prepared attempts', () => { }); expect(await exists(targetMarker)).toBe(false); - const row = JSON.parse((await readFile(path.join(runDir, 'index.jsonl'), 'utf8')).trim()); + const row = JSON.parse( + (await readFile(path.join(runDir, 'run_manifest.jsonl'), 'utf8')).trim(), + ); const answerPath = row.answer_path ?? row.response_path ?? row.output_path; expect(typeof answerPath).toBe('string'); expect((await readFile(path.join(runDir, answerPath), 'utf8')).trim()).toBe('done'); diff --git a/apps/cli/test/commands/results/combine.test.ts b/apps/cli/test/commands/results/combine.test.ts index 48b21dbc0..341bf4725 100644 --- a/apps/cli/test/commands/results/combine.test.ts +++ b/apps/cli/test/commands/results/combine.test.ts @@ -47,7 +47,16 @@ describe('results combine', () => { function seedRun(name: string, records: object[], experiment = 'default'): string { const runDir = path.join(tempDir, '.agentv', 'results', experiment, name); mkdirSync(path.join(runDir, 'demo', 'test-a'), { recursive: true }); - writeFileSync(path.join(runDir, 'index.jsonl'), toJsonl(...records), 'utf8'); + writeFileSync( + path.join(runDir, 'run_manifest.jsonl'), + toJsonl(...records.map((record) => ({ ...record, experiment }))), + 'utf8', + ); + writeFileSync( + path.join(runDir, 'summary.json'), + `${JSON.stringify({ manifest_path: 'run_manifest.jsonl' })}\n`, + 'utf8', + ); writeFileSync(path.join(runDir, 'demo', 'test-a', 'grading.json'), '{"assertions":[]}\n'); writeFileSync( path.join(runDir, 'demo', 'test-a', 'timing.json'), diff --git a/apps/cli/test/commands/results/export-e2e-providers.test.ts b/apps/cli/test/commands/results/export-e2e-providers.test.ts index 4b2b46d7d..991d29a91 100644 --- a/apps/cli/test/commands/results/export-e2e-providers.test.ts +++ b/apps/cli/test/commands/results/export-e2e-providers.test.ts @@ -16,6 +16,7 @@ import type { RunSummaryArtifact, TimingArtifact, } from '../../../src/commands/eval/artifact-writer.js'; +import { RESULT_INDEX_FILENAME } from '../../../src/commands/eval/artifact-writer.js'; import { exportResults } from '../../../src/commands/results/export.js'; // ── Provider-specific JSONL records (snake_case, matching on-disk format) ── @@ -212,7 +213,7 @@ function toJsonl(...records: object[]): string { } function readIndex(outputDir: string): IndexArtifactEntry[] { - return readFileSync(path.join(outputDir, 'index.jsonl'), 'utf8') + return readFileSync(path.join(outputDir, RESULT_INDEX_FILENAME), 'utf8') .trim() .split('\n') .filter(Boolean) diff --git a/apps/cli/test/commands/results/export.test.ts b/apps/cli/test/commands/results/export.test.ts index a9ad11237..04ff971b1 100644 --- a/apps/cli/test/commands/results/export.test.ts +++ b/apps/cli/test/commands/results/export.test.ts @@ -10,6 +10,7 @@ import type { TimingArtifact, } from '../../../src/commands/eval/artifact-writer.js'; import { parseJsonlResults } from '../../../src/commands/eval/artifact-writer.js'; +import { RESULT_INDEX_FILENAME } from '../../../src/commands/eval/result-layout.js'; import { buildProjectionBundleFromExportedIndex, deriveExportRunId, @@ -164,7 +165,7 @@ function toJsonl(...records: object[]): string { } function readIndex(outputDir: string): IndexArtifactEntry[] { - return readFileSync(path.join(outputDir, 'index.jsonl'), 'utf8') + return readFileSync(path.join(outputDir, RESULT_INDEX_FILENAME), 'utf8') .trim() .split('\n') .filter(Boolean) @@ -217,10 +218,10 @@ describe('results export', () => { rmSync(tempDir, { recursive: true, force: true }); }); - it('loadExportSource resolves run workspaces to index.jsonl', async () => { + it('loadExportSource resolves run workspaces to run_manifest.jsonl', async () => { const runDir = path.join(tempDir, '2026-03-18T10-00-00-000Z'); mkdirSync(runDir, { recursive: true }); - const sourceFile = path.join(runDir, 'index.jsonl'); + const sourceFile = path.join(runDir, RESULT_INDEX_FILENAME); writeFileSync(sourceFile, toJsonl(RESULT_FULL)); const { sourceFile: loadedSource, results } = await loadExportSource(runDir, tempDir); @@ -249,7 +250,7 @@ describe('results export', () => { 'results', 'with-skills', '2026-03-18T10-00-00-000Z', - 'index.jsonl', + RESULT_INDEX_FILENAME, ), ); expect(outputDir).toBe( @@ -259,7 +260,7 @@ describe('results export', () => { it('deriveOutputDir rejects non-manifest paths', () => { expect(() => deriveOutputDir(tempDir, path.join(tempDir, 'results.jsonl'))).toThrow( - 'Expected a run manifest named index.jsonl', + 'Expected a run manifest named run_manifest.jsonl', ); }); @@ -412,6 +413,7 @@ describe('results export', () => { expect(existsSync(summaryPath)).toBe(true); const benchmark: RunSummaryArtifact = JSON.parse(readFileSync(summaryPath, 'utf8')); + expect(benchmark.manifest_path).toBe(RESULT_INDEX_FILENAME); expect(benchmark.metadata.eval_file).toBe('eval_2026-03-18.jsonl'); expect(benchmark.metadata.timestamp).toBe('2026-03-18T10:00:01.000Z'); // artifact-writer uses string[] for tests_run, not a count @@ -424,7 +426,7 @@ describe('results export', () => { expect(benchmark.run_summary['gpt-4o'].pass_rate).toHaveProperty('stddev'); }); - it('should create index.jsonl with per-test artifact pointers', async () => { + it('should create run_manifest.jsonl with per-test artifact pointers', async () => { const outputDir = path.join(tempDir, 'output'); const resultWithInput = { ...RESULT_FULL, @@ -435,7 +437,7 @@ describe('results export', () => { await exportResults('test.jsonl', content, outputDir); - const indexPath = path.join(outputDir, 'index.jsonl'); + const indexPath = path.join(outputDir, RESULT_INDEX_FILENAME); expect(existsSync(indexPath)).toBe(true); const entries = readFileSync(indexPath, 'utf8') @@ -691,7 +693,7 @@ describe('results export', () => { await exportResults('test.jsonl', content, outputDir); expect(existsSync(path.join(outputDir, 'summary.json'))).toBe(true); - expect(existsSync(path.join(outputDir, 'index.jsonl'))).toBe(true); + expect(existsSync(path.join(outputDir, RESULT_INDEX_FILENAME))).toBe(true); expect(existsSync(path.join(outputDir, 'timing.json'))).toBe(false); expect(existsSync(path.join(runArtifactDir(outputDir, RESULT_FULL), 'grading.json'))).toBe( true, diff --git a/apps/cli/test/commands/results/remote-auto-export.test.ts b/apps/cli/test/commands/results/remote-auto-export.test.ts index 4e535172a..880800334 100644 --- a/apps/cli/test/commands/results/remote-auto-export.test.ts +++ b/apps/cli/test/commands/results/remote-auto-export.test.ts @@ -63,12 +63,16 @@ function writeRunArtifacts(projectDir: string): string { const runDir = path.join(projectDir, '.agentv', 'results', 'default', 'run-001'); mkdirSync(runDir, { recursive: true }); writeFileSync( - path.join(runDir, 'index.jsonl'), + path.join(runDir, 'run_manifest.jsonl'), `${JSON.stringify({ test_id: 'alpha', score: 1 })}\n`, ); writeFileSync( path.join(runDir, 'summary.json'), - `${JSON.stringify({ eval_file: 'evals/example.eval.yaml', tests_run: 1 }, null, 2)}\n`, + `${JSON.stringify( + { manifest_path: 'run_manifest.jsonl', eval_file: 'evals/example.eval.yaml', tests_run: 1 }, + null, + 2, + )}\n`, ); return runDir; } @@ -87,7 +91,7 @@ function writeRunArtifactsWithPointers(projectDir: string): string { writeFileSync(path.join(artifactDir, 'transcript.jsonl'), transcriptContent); const transcriptSha = sha256Hex(transcriptContent); writeFileSync( - path.join(runDir, 'index.jsonl'), + path.join(runDir, 'run_manifest.jsonl'), `${JSON.stringify({ test_id: 'alpha', score: 1, @@ -108,7 +112,11 @@ function writeRunArtifactsWithPointers(projectDir: string): string { ); writeFileSync( path.join(runDir, 'summary.json'), - `${JSON.stringify({ eval_file: 'evals/example.eval.yaml', tests_run: 1 }, null, 2)}\n`, + `${JSON.stringify( + { manifest_path: 'run_manifest.jsonl', eval_file: 'evals/example.eval.yaml', tests_run: 1 }, + null, + 2, + )}\n`, ); return runDir; } @@ -199,7 +207,7 @@ describe('maybeAutoExportRunArtifacts', () => { expect(status).toBe('published'); expect(git(`git --git-dir "${remoteDir}" ls-tree -r --name-only main`, rootDir)).toContain( - 'runs/default/run-001/index.jsonl', + 'runs/default/run-001/run_manifest.jsonl', ); }, 20_000); @@ -221,13 +229,13 @@ describe('maybeAutoExportRunArtifacts', () => { `git --git-dir "${remoteDir}" ls-tree -r --name-only ${resultsBranch}`, rootDir, ); - expect(resultTree).toContain('runs/default/run-002/index.jsonl'); + expect(resultTree).toContain('runs/default/run-002/run_manifest.jsonl'); expect(resultTree).toContain('runs/default/run-002/summary.json'); expect(resultTree).not.toContain('runs/default/run-002/alpha/trace.json'); expect(resultTree).not.toContain('runs/default/run-002/alpha/transcript.jsonl'); const index = JSON.parse( git( - `git --git-dir "${remoteDir}" show ${resultsBranch}:runs/default/run-002/index.jsonl`, + `git --git-dir "${remoteDir}" show ${resultsBranch}:runs/default/run-002/run_manifest.jsonl`, rootDir, ), ); @@ -315,10 +323,10 @@ describe('maybeAutoExportRunArtifacts', () => { expect(status).toBe('published'); expect(git(`git --git-dir "${remoteDir}" ls-tree -r --name-only main`, rootDir)).not.toContain( - 'runs/default/run-001/index.jsonl', + 'runs/default/run-001/run_manifest.jsonl', ); expect(git('git ls-tree -r --name-only main', cloneDir)).toContain( - 'runs/default/run-001/index.jsonl', + 'runs/default/run-001/run_manifest.jsonl', ); }); }); diff --git a/apps/cli/test/commands/results/report.test.ts b/apps/cli/test/commands/results/report.test.ts index 75f82a924..e7c2e0f5d 100644 --- a/apps/cli/test/commands/results/report.test.ts +++ b/apps/cli/test/commands/results/report.test.ts @@ -6,7 +6,10 @@ import vm from 'node:vm'; import { type EvaluationResult, type GraderResult, buildTraceFromMessages } from '@agentv/core'; -import { writeArtifactsFromResults } from '../../../src/commands/eval/artifact-writer.js'; +import { + RESULT_INDEX_FILENAME, + writeArtifactsFromResults, +} from '../../../src/commands/eval/artifact-writer.js'; import { deriveReportPath, loadReportSource, @@ -115,7 +118,7 @@ describe('results report', () => { { evalFile: 'evals/demo.eval.yaml' }, ); - const indexPath = path.join(runDir, 'index.jsonl'); + const indexPath = path.join(runDir, RESULT_INDEX_FILENAME); const lines = readFileSync(indexPath, 'utf8') .trim() .split('\n') diff --git a/apps/cli/test/commands/results/serve.test.ts b/apps/cli/test/commands/results/serve.test.ts index 3e0295fcd..6fb87629a 100644 --- a/apps/cli/test/commands/results/serve.test.ts +++ b/apps/cli/test/commands/results/serve.test.ts @@ -382,7 +382,7 @@ function writeWtgDogfoodNoncanonicalArtifact(baseDir: string): { } { const runDir = path.join(baseDir, 'wtg-dogfood-noncanonical-run'); mkdirSync(runDir, { recursive: true }); - const indexPath = path.join(runDir, 'index.jsonl'); + const indexPath = path.join(runDir, 'run_manifest.jsonl'); writeFileSync(indexPath, toJsonl({ ...RESULT_A, test_id: 'wtg-dogfood-noncanonical' })); return { runDir, indexPath }; } @@ -430,7 +430,7 @@ describe('resolveSourceFile', () => { rmSync(tempDir, { recursive: true, force: true }); }); - it('rejects direct WTG dogfood index.jsonl manifests with setup guidance', async () => { + it('rejects direct WTG dogfood run manifests with setup guidance', async () => { const tempDir = mkdtempSync(path.join(tmpdir(), 'agentv-serve-source-')); const { indexPath } = writeWtgDogfoodNoncanonicalArtifact(tempDir); @@ -445,7 +445,7 @@ describe('resolveSourceFile', () => { const tempDir = mkdtempSync(path.join(tmpdir(), 'agentv-serve-source-')); const runDir = localRunDir(tempDir, 'default', '2026-06-17T00-00-00-000Z'); mkdirSync(runDir, { recursive: true }); - const indexPath = path.join(runDir, 'index.jsonl'); + const indexPath = path.join(runDir, 'run_manifest.jsonl'); writeFileSync(indexPath, toJsonl(RESULT_A)); await expect(resolveSourceFile(undefined, tempDir)).resolves.toBe(indexPath); @@ -475,7 +475,7 @@ describe('dashboard CLI source contract', () => { rmSync(tempDir, { recursive: true, force: true }); }); - it('fails before serving a direct WTG dogfood index.jsonl manifest', () => { + it('fails before serving a direct WTG dogfood run manifest', () => { const tempDir = mkdtempSync(path.join(tmpdir(), 'agentv-dashboard-source-cli-')); const projectDir = path.join(tempDir, 'project'); mkdirSync(projectDir, { recursive: true }); @@ -490,7 +490,7 @@ describe('dashboard CLI source contract', () => { expect(result.signal).toBeNull(); expect(result.stdout).not.toContain('Serving 1 result(s)'); expect(result.stderr).toContain('Unsupported Dashboard source'); - expect(result.stderr).toContain('agentv results report '); + expect(result.stderr).toContain('agentv results report '); rmSync(tempDir, { recursive: true, force: true }); }); @@ -2788,8 +2788,13 @@ describe('serve app', () => { ): { runId: string; runDir: string; manifestPath: string } { const runDir = localRunDir(opts?.baseDir ?? tempDir, opts?.experiment ?? 'default', name); mkdirSync(runDir, { recursive: true }); - const manifestPath = path.join(runDir, 'index.jsonl'); - writeFileSync(manifestPath, toJsonl(...records)); + const manifestPath = path.join(runDir, 'run_manifest.jsonl'); + writeFileSync( + manifestPath, + toJsonl( + ...records.map((record) => ({ ...record, experiment: opts?.experiment ?? 'default' })), + ), + ); if (opts?.tags) { writeFileSync( path.join(runDir, 'tags.json'), @@ -2892,7 +2897,7 @@ describe('serve app', () => { expect(detailRes.status).toBe(200); await detailRes.json(); const records = readFileSync( - path.join(localRunDirFromRunId(tempDir, acceptedData.run_id), 'index.jsonl'), + path.join(localRunDirFromRunId(tempDir, acceptedData.run_id), 'run_manifest.jsonl'), 'utf8', ) .trim() @@ -4399,12 +4404,14 @@ describe('serve app', () => { headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ suite_filter: 'examples/demo.eval.yaml', - retry_errors: '.agentv/results/default/r0/index.jsonl', + retry_errors: '.agentv/results/default/r0/run_manifest.jsonl', }), }); expect(res.status).toBe(202); const data = (await res.json()) as { command: string }; - expect(data.command).toContain('--retry-errors .agentv/results/default/r0/index.jsonl'); + expect(data.command).toContain( + '--retry-errors .agentv/results/default/r0/run_manifest.jsonl', + ); }); it('rejects resume + rerun_failed combo with 400', async () => { @@ -4433,7 +4440,7 @@ describe('serve app', () => { suite_filter: 'examples/demo.eval.yaml', output: '.agentv/results/default/r1', resume: true, - retry_errors: '.agentv/results/default/r0/index.jsonl', + retry_errors: '.agentv/results/default/r0/run_manifest.jsonl', }), }); expect(res.status).toBe(400); @@ -4581,12 +4588,14 @@ describe('serve app', () => { headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ suite_filter: 'examples/demo.eval.yaml', - retry_errors: '.agentv/results/default/r0/index.jsonl', + retry_errors: '.agentv/results/default/r0/run_manifest.jsonl', }), }); expect(res.status).toBe(200); const data = (await res.json()) as { command: string }; - expect(data.command).toContain('--retry-errors .agentv/results/default/r0/index.jsonl'); + expect(data.command).toContain( + '--retry-errors .agentv/results/default/r0/run_manifest.jsonl', + ); }); it('emits --experiment for selected experiment requests', async () => { diff --git a/apps/cli/test/commands/results/shared.test.ts b/apps/cli/test/commands/results/shared.test.ts index 1760071e7..be71848cb 100644 --- a/apps/cli/test/commands/results/shared.test.ts +++ b/apps/cli/test/commands/results/shared.test.ts @@ -27,7 +27,7 @@ describe('results shared source resolution', () => { rmSync(tempDir, { recursive: true, force: true }); }); - it('resolves an explicit run workspace directory to index.jsonl', async () => { + it('resolves an explicit legacy run workspace directory to index.jsonl', async () => { const runDir = path.join(tempDir, '.agentv', 'results', 'default', '2026-03-25T10-00-00-000Z'); mkdirSync(runDir, { recursive: true }); writeFileSync(path.join(runDir, 'index.jsonl'), '{"test_id":"t1","score":1}\n'); @@ -67,7 +67,7 @@ describe('results shared source resolution', () => { writeFileSync(flatFile, '{"test_id":"t1","score":1}\n'); expect(() => resolveRunManifestPath(flatFile)).toThrow( - 'Expected a run workspace directory or index.jsonl manifest', + 'Expected a run workspace directory or run_manifest.jsonl manifest', ); }); diff --git a/apps/cli/test/commands/results/validate.test.ts b/apps/cli/test/commands/results/validate.test.ts index 4c68e016f..b230cca71 100644 --- a/apps/cli/test/commands/results/validate.test.ts +++ b/apps/cli/test/commands/results/validate.test.ts @@ -19,7 +19,7 @@ describe('results validate', () => { ); mkdirSync(runDir, { recursive: true }); writeFileSync( - path.join(runDir, 'index.jsonl'), + path.join(runDir, 'run_manifest.jsonl'), `${JSON.stringify({ timestamp: '2026-03-27T12:42:24.429Z', test_id: 'test-greeting', @@ -43,6 +43,7 @@ describe('results validate', () => { writeFileSync( path.join(runDir, 'summary.json'), `${JSON.stringify({ + manifest_path: 'run_manifest.jsonl', schema_version: 1, metadata: { experiment: 'with-skills', diff --git a/apps/cli/test/commands/runs/rerun.test.ts b/apps/cli/test/commands/runs/rerun.test.ts index 28016c0ba..903e0c74a 100644 --- a/apps/cli/test/commands/runs/rerun.test.ts +++ b/apps/cli/test/commands/runs/rerun.test.ts @@ -147,6 +147,9 @@ async function readJsonLines(filePath: string): Promise { const entries = await readdir(dir, { withFileTypes: true }); + if (entries.some((entry) => entry.isFile() && entry.name === 'run_manifest.jsonl')) { + return [path.join(dir, 'run_manifest.jsonl')]; + } if (entries.some((entry) => entry.isFile() && entry.name === 'index.jsonl')) { return [path.join(dir, 'index.jsonl')]; } diff --git a/apps/cli/test/commands/trend/trend.test.ts b/apps/cli/test/commands/trend/trend.test.ts index db1a6b2bc..b96fc3387 100644 --- a/apps/cli/test/commands/trend/trend.test.ts +++ b/apps/cli/test/commands/trend/trend.test.ts @@ -37,12 +37,17 @@ async function createRunWorkspace( ): Promise<{ runDir: string; indexPath: string }> { const runDir = path.join(rootDir, '.agentv', 'results', 'default', runName); await mkdir(runDir, { recursive: true }); - const indexPath = path.join(runDir, 'index.jsonl'); + const indexPath = path.join(runDir, 'run_manifest.jsonl'); await writeFile( indexPath, `${records.map((record) => JSON.stringify(record)).join('\n')}\n`, 'utf8', ); + await writeFile( + path.join(runDir, 'summary.json'), + `${JSON.stringify({ manifest_path: 'run_manifest.jsonl' })}\n`, + 'utf8', + ); return { runDir, indexPath }; } @@ -239,6 +244,69 @@ describe('trend command', () => { ); }); + it('still accepts legacy index.jsonl run manifests explicitly', async () => { + const cwd = await createTempDir(); + cleanupDirs.push(cwd); + + const runDir = path.join(cwd, '.agentv', 'results', 'default', '2026-03-01T10-00-00-000Z'); + await mkdir(runDir, { recursive: true }); + const legacyManifest = path.join(runDir, 'index.jsonl'); + await writeFile( + legacyManifest, + `${JSON.stringify({ + test_id: 't1', + target: 'alpha', + score: 0.9, + timestamp: '2026-03-01T10:00:00.000Z', + })}\n`, + 'utf8', + ); + + expect(resolveTrendSources(cwd, [legacyManifest])).toEqual([legacyManifest]); + }); + + it('discovers legacy-only run workspaces with --last', async () => { + const cwd = await createTempDir(); + cleanupDirs.push(cwd); + + const firstRunDir = path.join(cwd, '.agentv', 'results', 'default', '2026-03-01T10-00-00-000Z'); + const secondRunDir = path.join( + cwd, + '.agentv', + 'results', + 'default', + '2026-03-08T10-00-00-000Z', + ); + await mkdir(firstRunDir, { recursive: true }); + await mkdir(secondRunDir, { recursive: true }); + const firstRecord = { + test_id: 't1', + score: 0.8, + timestamp: '2026-03-01T10:00:00.000Z', + }; + const secondRecord = { + test_id: 't1', + score: 0.85, + timestamp: '2026-03-08T10:00:00.000Z', + }; + await writeFile( + path.join(firstRunDir, 'index.jsonl'), + `${JSON.stringify(firstRecord)}\n`, + 'utf8', + ); + await writeFile( + path.join(secondRunDir, 'index.jsonl'), + `${JSON.stringify(secondRecord)}\n`, + 'utf8', + ); + + const sources = resolveTrendSources(cwd, [], 2); + expect(sources).toEqual([ + path.join(firstRunDir, 'index.jsonl'), + path.join(secondRunDir, 'index.jsonl'), + ]); + }); + it('discovers canonical run workspaces with --last ordering oldest to newest', async () => { const cwd = await createTempDir(); cleanupDirs.push(cwd); diff --git a/apps/cli/test/eval.integration.test.ts b/apps/cli/test/eval.integration.test.ts index 77896ddc5..2481790f9 100644 --- a/apps/cli/test/eval.integration.test.ts +++ b/apps/cli/test/eval.integration.test.ts @@ -338,7 +338,7 @@ describe('agentv eval CLI', () => { ]); expect(exitCode).toBe(0); - const indexPath = path.join(outputDir, 'file-target', 'index.jsonl'); + const indexPath = path.join(outputDir, 'file-target', 'run_manifest.jsonl'); expect(extractOutputPath(stdout)).toBe(indexPath); expect(stdout).toContain(`Artifact directory: ${outputDir}`); @@ -366,7 +366,7 @@ describe('agentv eval CLI', () => { const outputDir = path.join(fixture.suiteDir, 'configured-results'); expect(exitCode).toBe(0); - const indexPath = path.join(outputDir, 'file-target', 'index.jsonl'); + const indexPath = path.join(outputDir, 'file-target', 'run_manifest.jsonl'); expect(extractOutputPath(stdout)).toBe(indexPath); await expectFileExists(indexPath); await expectFileExists(path.join(outputDir, 'file-target', 'summary.json')); @@ -382,7 +382,7 @@ describe('agentv eval CLI', () => { } }, 30_000); - it('rejects removed --export and keeps --output as the canonical index location', async () => { + it('rejects removed --export and keeps --output as the canonical manifest location', async () => { const fixture = await createFixture(); try { const outputDir = path.join(fixture.baseDir, 'run'); @@ -410,7 +410,7 @@ describe('agentv eval CLI', () => { ]); expect(exitCode).toBe(1); - const indexPath = path.join(outputDir, 'file-target', 'index.jsonl'); + const indexPath = path.join(outputDir, 'file-target', 'run_manifest.jsonl'); expect(extractOutputPath(stdout)).toBe(indexPath); expect(stdout).not.toContain('Export files:'); @@ -454,14 +454,14 @@ describe('agentv eval CLI', () => { }, { args: ['--output-format', 'html'], - expected: ['--output-format was removed', 'index.jsonl'], + expected: ['--output-format was removed', 'run_manifest.jsonl'], }, { args: ['--output', 'results.xml'], expected: [ '--output expects a run directory', 'JUnit XML export from agentv eval has been removed', - '/index.jsonl', + '/run_manifest.jsonl', ], }, ] as const; diff --git a/apps/cli/test/unit/retry-errors.test.ts b/apps/cli/test/unit/retry-errors.test.ts index ce3b28dc1..0b2cd0f99 100644 --- a/apps/cli/test/unit/retry-errors.test.ts +++ b/apps/cli/test/unit/retry-errors.test.ts @@ -19,7 +19,15 @@ describe('retry-errors', () => { } }); - function createIndexFile(lines: object[]): string { + function createRunManifestFile(lines: object[]): string { + tmpDir = mkdtempSync(path.join(tmpdir(), 'retry-errors-test-')); + const filePath = path.join(tmpDir, 'run_manifest.jsonl'); + mkdirSync(tmpDir, { recursive: true }); + writeFileSync(filePath, lines.map((l) => JSON.stringify(l)).join('\n')); + return filePath; + } + + function createLegacyIndexFile(lines: object[]): string { tmpDir = mkdtempSync(path.join(tmpdir(), 'retry-errors-test-')); const filePath = path.join(tmpDir, 'index.jsonl'); mkdirSync(tmpDir, { recursive: true }); @@ -35,7 +43,7 @@ describe('retry-errors', () => { } it('loadErrorTestIds returns only execution_error test IDs', async () => { - const filePath = createIndexFile([ + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'ok', score: 0.9 }, { test_id: 'case-2', execution_status: 'execution_error', score: 0, error: 'timeout' }, { test_id: 'case-3', execution_status: 'quality_failure', score: 0.3 }, @@ -52,7 +60,7 @@ describe('retry-errors', () => { }); it('loadErrorTestIds deduplicates IDs', async () => { - const filePath = createIndexFile([ + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'execution_error', score: 0 }, { test_id: 'case-1', execution_status: 'execution_error', score: 0 }, ]); @@ -62,7 +70,7 @@ describe('retry-errors', () => { }); it('loadErrorTestIds returns empty array when no errors', async () => { - const filePath = createIndexFile([ + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'ok', score: 0.9 }, { test_id: 'case-2', execution_status: 'quality_failure', score: 0.5 }, ]); @@ -72,7 +80,7 @@ describe('retry-errors', () => { }); it('loadNonErrorResults returns only non-error results', async () => { - const filePath = createIndexFile([ + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'ok', score: 0.9 }, { test_id: 'case-2', execution_status: 'execution_error', score: 0 }, { test_id: 'case-3', execution_status: 'quality_failure', score: 0.5 }, @@ -84,8 +92,8 @@ describe('retry-errors', () => { expect(results[1].testId).toBe('case-3'); }); - it('supports index.jsonl manifests written by the CLI', async () => { - const filePath = createIndexFile([ + it('supports run_manifest.jsonl manifests written by the CLI', async () => { + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'ok', score: 0.9 }, { test_id: 'case-2', execution_status: 'execution_error', score: 0 }, { test_id: 'case-3', execution_status: 'quality_failure', score: 0.5 }, @@ -107,15 +115,15 @@ describe('retry-errors', () => { ]); await expect(loadErrorTestIds(filePath)).rejects.toThrow( - 'Expected a run workspace directory or index.jsonl manifest', + 'Expected a run workspace directory or run_manifest.jsonl manifest', ); await expect(loadNonErrorResults(filePath)).rejects.toThrow( - 'Expected a run workspace directory or index.jsonl manifest', + 'Expected a run workspace directory or run_manifest.jsonl manifest', ); }); - it('supports index.jsonl manifests', async () => { - const filePath = createIndexFile([ + it('supports legacy index.jsonl manifests', async () => { + const filePath = createLegacyIndexFile([ { test_id: 'case-1', execution_status: 'ok', @@ -137,7 +145,7 @@ describe('retry-errors', () => { }); it('loadFullyCompletedTestIds returns only non-error test IDs', async () => { - const filePath = createIndexFile([ + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'ok', score: 0.9 }, { test_id: 'case-2', execution_status: 'execution_error', score: 0, error: 'timeout' }, { test_id: 'case-3', execution_status: 'quality_failure', score: 0.3 }, @@ -154,7 +162,7 @@ describe('retry-errors', () => { }); it('loadFullyCompletedTestIds returns empty array when all are errors', async () => { - const filePath = createIndexFile([ + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'execution_error', score: 0 }, { test_id: 'case-2', execution_status: 'execution_error', score: 0 }, ]); @@ -164,7 +172,7 @@ describe('retry-errors', () => { }); it('loadFullyCompletedTestIds excludes IDs that errored on any target (matrix safety)', async () => { - const filePath = createIndexFile([ + const filePath = createRunManifestFile([ { test_id: 'case-1', execution_status: 'ok', score: 0.9, target: 'gpt-4' }, { test_id: 'case-1', execution_status: 'execution_error', score: 0, target: 'claude' }, { test_id: 'case-2', execution_status: 'ok', score: 0.8, target: 'gpt-4' }, @@ -187,9 +195,9 @@ describe('retry-errors', () => { expect(buildExclusionFilter(['!negated'])).toBe('!\\!negated'); }); - it('throws on malformed index.jsonl lines', async () => { + it('throws on malformed run_manifest.jsonl lines', async () => { tmpDir = mkdtempSync(path.join(tmpdir(), 'retry-errors-test-')); - const filePath = path.join(tmpDir, 'index.jsonl'); + const filePath = path.join(tmpDir, 'run_manifest.jsonl'); writeFileSync( filePath, [ diff --git a/apps/dashboard/src/components/StopRunButton.tsx b/apps/dashboard/src/components/StopRunButton.tsx index 9ca7a15e6..c6541b4d6 100644 --- a/apps/dashboard/src/components/StopRunButton.tsx +++ b/apps/dashboard/src/components/StopRunButton.tsx @@ -2,7 +2,7 @@ * StopRunButton — stop affordance on /jobs/:runId and active run detail * views that interrupts a Dashboard-launched eval. Stop is part of the * stop → resume → complete workflow, not a destructive cancel: the - * partial index.jsonl is preserved and can be resumed in one click from + * partial run_manifest.jsonl is preserved and can be resumed in one click from * the run-detail page. * * Calls POST /api/eval/run/:id/stop (or the project-scoped variant). diff --git a/apps/dashboard/src/lib/types.ts b/apps/dashboard/src/lib/types.ts index ea5084dea..033d68fb9 100644 --- a/apps/dashboard/src/lib/types.ts +++ b/apps/dashboard/src/lib/types.ts @@ -682,7 +682,7 @@ export interface RunEvalRequest { resume?: boolean; /** Re-run failed/errored tests while keeping passing results. */ rerun_failed?: boolean; - /** Path to a previous run dir or index.jsonl — re-run only execution_error cases. */ + /** Path to a previous run dir or run manifest — re-run only execution_error cases. */ retry_errors?: string; /** Artifact directory for run output — required to target an existing run dir. */ output?: string; diff --git a/apps/web/src/content/docs/docs/evaluation/experiments.mdx b/apps/web/src/content/docs/docs/evaluation/experiments.mdx index 0ee1ac2b2..636b5e227 100644 --- a/apps/web/src/content/docs/docs/evaluation/experiments.mdx +++ b/apps/web/src/content/docs/docs/evaluation/experiments.mdx @@ -135,7 +135,7 @@ Suite imports are resolved as a deterministic include graph. Circular `type: suite` imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks. -Imported suite rows keep their source suite metadata in `index.jsonl`. Use each +Imported suite rows keep their source suite metadata in `run_manifest.jsonl`. Use each row's `result_dir` as the authoritative path to generated artifacts inside the run directory; do not infer layout from suite names. @@ -275,7 +275,7 @@ derives the group from the eval input: a single eval uses the eval metadata `multi-eval`. Inline `experiment.name` does not currently select the result group. -Imported source suite metadata appears in `index.jsonl` rows and manifests. -Use `index.jsonl` fields such as `eval_path`, `test_id`, `target`, and +Imported source suite metadata appears in `run_manifest.jsonl` rows and manifests. +Use `run_manifest.jsonl` fields such as `eval_path`, `test_id`, `target`, and `result_dir` for identity and artifact discovery instead of reconstructing paths from suite names or wrapper layout. diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx index cc6e74e45..2b865dcc4 100644 --- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx @@ -11,7 +11,7 @@ sidebar: agentv eval evals/my-eval.yaml ``` -Results are written to `.agentv/results///index.jsonl`. +Results are written to `.agentv/results///run_manifest.jsonl`. AgentV picks the experiment bucket from `--experiment`, then `eval.yaml` `experiment.name`, then `default`. Each CLI invocation writes one timestamped run bundle. Each line is a JSON object with one result per test @@ -63,7 +63,7 @@ agentv eval evals/my-eval.yaml --experiment without_skills ``` The experiment label chooses the result bucket and is propagated to each entry -in `index.jsonl`. CLI `--experiment` wins over `experiment.name` in the eval +in `run_manifest.jsonl`. CLI `--experiment` wins over `experiment.name` in the eval file. If neither is set, AgentV writes to the `default` bucket. The eval file stays the same across experiments; what changes is the runtime condition. Dashboards can filter and compare results by experiment. @@ -97,22 +97,22 @@ are unchanged. ### Custom Output Directory -Write all artifacts (index.jsonl, summary.json, per-test grading/timing) to a specific directory: +Write all artifacts (run_manifest.jsonl, summary.json, per-test grading/timing) to a specific directory: ```bash agentv eval evals/my-eval.yaml --output ./my-results ``` `--output` is a run directory, not a file path. The canonical manifest is always -`/index.jsonl`. +`/run_manifest.jsonl`. -### Read Results from the Run Index +### Read Results from the Run Manifest -The run directory is the complete artifact boundary. Use `/index.jsonl` for scripts, CI summaries, and downstream tools: +The run directory is the complete artifact boundary. Use `/run_manifest.jsonl` for scripts, CI summaries, and downstream tools: ```bash agentv eval evals/my-eval.yaml --output ./my-results -cat ./my-results/index.jsonl +cat ./my-results/run_manifest.jsonl ``` ### Generated Task Bundles @@ -126,7 +126,7 @@ Typical layout: ```text my-results/ - index.jsonl + run_manifest.jsonl summary.json / summary.json @@ -145,11 +145,11 @@ my-results/ graders/ # copied grader prompt/script files when applicable ``` -The `index.jsonl` row links to these generated paths with snake_case fields such +The `run_manifest.jsonl` row links to these generated paths with snake_case fields such as `result_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path`. Treat those paths as relative to the run directory. When you need a portable artifact for audit, review, Dashboard inspection, or rerun workflows, -share the generated run directory and its `index.jsonl` manifest. Source-side +share the generated run directory and its `run_manifest.jsonl` manifest. Source-side case directories are still useful for organizing bulky prompts, fixtures, or tests while authoring an eval, but they are optional input organization rather than a separate artifact schema. @@ -175,12 +175,12 @@ manifest shape, and optional trace/session input with `--trace`. Export execution traces (tool calls, timing, spans) to files for debugging and analysis: -By default, AgentV writes a per-run workspace with `index.jsonl` as the canonical manifest for +By default, AgentV writes a per-run workspace with `run_manifest.jsonl` as the canonical manifest for result-oriented workflows. For full-fidelity span inspection, export OTLP JSON explicitly. ```bash # Summary-level inspection from the run manifest -agentv inspect stats .agentv/results/default//index.jsonl +agentv inspect stats .agentv/results/default//run_manifest.jsonl # Full-fidelity OTLP JSON trace (importable by OTel backends like Jaeger, Grafana) agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json @@ -189,7 +189,7 @@ agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json agentv inspect show traces/eval.otlp.json --tree ``` -`index.jsonl` contains aggregate metrics such as score, latency, cost, token usage, and summary +`run_manifest.jsonl` contains aggregate metrics such as score, latency, cost, token usage, and summary trace counters. `--otel-file` writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend. @@ -349,7 +349,7 @@ AgentV ships three flags for picking up a partial run. They differ only in **whi | `--rerun-failed` | Only cases with `executionStatus === 'ok'` | Errors **and** test failures (assertion misses, threshold misses) | A grader change or model swap means you want to re-grade everything that wasn't already passing | | `--retry-errors ` | Anything that completed without an `execution_error` (same set as `--resume`) | Errors and missing cases | You want to point at an arbitrary prior run/manifest by path, instead of resuming the run dir you're currently writing to | -`--resume` and `--rerun-failed` both append to the existing `index.jsonl`. When `--output ` is given they target that directory; when omitted they default to the **last run dir for the current cwd**, recorded in `.agentv/cache.json` and updated after every eval. This matches promptfoo's `--resume [evalId]` and OpenCompass's `-r [timestamp]` "latest by default" convention. `--retry-errors` takes the prior run's path directly (a directory or an `index.jsonl`). +`--resume` and `--rerun-failed` both append to the existing `run_manifest.jsonl`. When `--output ` is given they target that directory; when omitted they default to the **last run dir for the current cwd**, recorded in `.agentv/cache.json` and updated after every eval. This matches promptfoo's `--resume [evalId]` and OpenCompass's `-r [timestamp]` "latest by default" convention. `--retry-errors` takes the prior run's path directly (a directory or an `run_manifest.jsonl`). ```bash # Resume the last run — no args needed; AgentV finds it from .agentv/cache.json @@ -362,7 +362,7 @@ agentv eval evals/my-eval.yaml --output .agentv/results/default/ --re agentv eval evals/my-eval.yaml --rerun-failed # Re-run only execution errors from any prior run by path -agentv eval evals/my-eval.yaml --retry-errors .agentv/results/default//index.jsonl +agentv eval evals/my-eval.yaml --retry-errors .agentv/results/default//run_manifest.jsonl ``` After any failing run, the CLI prints the exact `--rerun-failed` command for the run dir that just completed — copy/paste it. If the process or pod disappeared before you could access the local run directory and results auto-push was enabled, recover the partial run from [WIP checkpoints](/docs/tools/wip-checkpoints/) first, then use the same `--resume` flow. @@ -486,7 +486,7 @@ When automatic remote publishing sees pointers whose `ref` is `agentv/artifacts/v1` branch in the same results remote at `runs//` and rewrites the published pointer `key` to that backend object key. The configured results branch is the metadata/control -plane for `index.jsonl`, `summary.json`, tags, and pointers; it does not +plane for `run_manifest.jsonl`, `summary.json`, tags, and pointers; it does not duplicate canonical transcript payload bodies when those rows name `agentv/artifacts/v1`. Dashboard resolves the published pointers lazily when a transcript view requests the payload. AgentV keeps this explicit pointer/backend diff --git a/apps/web/src/content/docs/docs/getting-started/quickstart.mdx b/apps/web/src/content/docs/docs/getting-started/quickstart.mdx index ddd7ceedb..de8769dbf 100644 --- a/apps/web/src/content/docs/docs/getting-started/quickstart.mdx +++ b/apps/web/src/content/docs/docs/getting-started/quickstart.mdx @@ -66,7 +66,7 @@ tests: agentv eval ./evals/example.yaml ``` -Results appear in `.agentv/results/default//index.jsonl` with scores, reasoning, and execution traces. +Results appear in `.agentv/results/default//run_manifest.jsonl` with scores, reasoning, and execution traces. ## Next Steps diff --git a/apps/web/src/content/docs/docs/guides/autoresearch.mdx b/apps/web/src/content/docs/docs/guides/autoresearch.mdx index 963ee7d08..cee37d799 100644 --- a/apps/web/src/content/docs/docs/guides/autoresearch.mdx +++ b/apps/web/src/content/docs/docs/guides/autoresearch.mdx @@ -81,7 +81,7 @@ Each autoresearch session creates a self-contained experiment directory: │ ├── iterations.jsonl # Per-cycle data (score, decision, mutation) │ └── trajectory.html # Live-updating Chart.js visualization ├── 2026-04-15T10-30-00/ # Cycle 1 run artifacts -│ ├── index.jsonl +│ ├── run_manifest.jsonl │ ├── grading.json │ └── timing.json ├── 2026-04-15T10-35-00/ # Cycle 2 run artifacts @@ -101,7 +101,7 @@ Review the mutation history with `git log` after the run completes. After each eval cycle, autoresearch runs `agentv compare` between the current candidate and the best baseline: ```bash -agentv compare /index.jsonl /index.jsonl --json +agentv compare /run_manifest.jsonl /run_manifest.jsonl --json ``` The decision rule: diff --git a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx index e6a195983..5e904087c 100644 --- a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx +++ b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx @@ -80,7 +80,7 @@ Benchmark task packs map cleanly onto AgentV fields at authoring time: Use this separation only when it makes the source eval easier to maintain. It is not a first-class artifact schema. After an eval runs, AgentV writes the portable audit surface into the generated run folder: each result can link from -`index.jsonl` to a run-local `task/` bundle containing `EVAL.yaml`, +`run_manifest.jsonl` to a run-local `task/` bundle containing `EVAL.yaml`, `targets.yaml`, and copied `files/` or `graders/` snapshots where applicable. Review, Dashboard files views, and rerun workflows should inspect those generated run artifacts instead of requiring authors to maintain a parallel source-side diff --git a/apps/web/src/content/docs/docs/guides/human-review.mdx b/apps/web/src/content/docs/docs/guides/human-review.mdx index 26a9ad601..24f410fcf 100644 --- a/apps/web/src/content/docs/docs/guides/human-review.mdx +++ b/apps/web/src/content/docs/docs/guides/human-review.mdx @@ -38,7 +38,7 @@ For workspace evaluations (EVAL.yaml), inspect the run manifest and generate the ```bash # View traces from a specific run -agentv inspect show results/2026-03-14T10-32-00_claude/index.jsonl +agentv inspect show results/2026-03-14T10-32-00_claude/run_manifest.jsonl # Generate the HTML report from the run workspace agentv results report results/2026-03-14T10-32-00_claude @@ -61,12 +61,12 @@ cat results/output.jsonl | jq '{id: .test_id, score: .score, verdict: .verdict}' ### Write feedback -Create a `feedback.json` file in the run workspace, alongside `index.jsonl`: +Create a `feedback.json` file in the run workspace, alongside `run_manifest.jsonl`: ``` results/ 2026-03-14T10-32-00_claude/ - index.jsonl # run manifest + run_manifest.jsonl # run manifest trace.otlp.json # optional OTLP trace export feedback.json # ← your review annotations ``` @@ -161,13 +161,13 @@ Keep feedback files alongside results to build a history of review decisions: ``` results/ 2026-03-12T09-00-00_claude/ - index.jsonl + run_manifest.jsonl feedback.json # first iteration review 2026-03-14T10-32-00_claude/ - index.jsonl + run_manifest.jsonl feedback.json # second iteration review 2026-03-15T16-00-00_claude/ - index.jsonl + run_manifest.jsonl feedback.json # third iteration review ``` diff --git a/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx b/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx index 0bea029db..00420c061 100644 --- a/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx +++ b/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx @@ -256,7 +256,7 @@ If you've been using the Agent Skills skill-creator workflow, AgentV reads your | `claude -p "prompt"` | `agentv eval evals.json --target claude` | Same eval, richer engine | | `grading.json` (read) | `/grading.json` (write) | Same per-test schema, AgentV writes one grading file per test case | | `summary.json` (read) | `/summary.json` (write) | AgentV writes the canonical run summary; convert it in a wrapper if another tool needs a narrower compatibility shape | -| n/a | `index.jsonl` (write) | AgentV-specific per-test manifest for filtering, retry, and replay workflows | +| n/a | `run_manifest.jsonl` (write) | AgentV-specific per-test manifest for filtering, retry, and replay workflows | | with-skill vs without-skill | `--target baseline --target candidate` | Structured comparison | | Graduate to richer evals | `agentv convert evals.json` → EVAL.yaml | Adds workspace, code graders, etc. | @@ -274,7 +274,7 @@ agentv pipeline run evals/my-eval.yaml --experiment without_skills agentv pipeline run evals/my-eval.yaml --experiment with_skills ``` -Both runs use the same eval file and produce separate run directories. The experiment label is recorded in `manifest.json` and `index.jsonl`, making it easy to filter and compare in dashboards. +Both runs use the same eval file and produce separate run directories. The experiment label is recorded in `manifest.json` and `run_manifest.jsonl`, making it easy to filter and compare in dashboards. This replaces the need for separate `--target baseline` / `--target candidate` configurations when the only difference between runs is the workspace setup (skills, config, etc.) rather than the target harness. diff --git a/apps/web/src/content/docs/docs/tools/compare.mdx b/apps/web/src/content/docs/docs/tools/compare.mdx index bafb85722..8e87e9312 100644 --- a/apps/web/src/content/docs/docs/tools/compare.mdx +++ b/apps/web/src/content/docs/docs/tools/compare.mdx @@ -15,9 +15,13 @@ Run two evaluations and compare them: agentv eval evals/my-eval.yaml --output .agentv/results/default/before # ... make changes to your agent ... agentv eval evals/my-eval.yaml --output .agentv/results/default/after -agentv compare .agentv/results/default/before/index.jsonl .agentv/results/default/after/index.jsonl +agentv compare .agentv/results/default/before/run_manifest.jsonl .agentv/results/default/after/run_manifest.jsonl ``` +`run_manifest.jsonl` is the canonical row-level result manifest. Existing +`index.jsonl` run manifests from older AgentV runs remain readable for +compatibility, but new runs write `run_manifest.jsonl`. + ## Options | Option | Description | @@ -28,7 +32,7 @@ agentv compare .agentv/results/default/before/index.jsonl .agentv/results/defaul ## How It Works -1. **Load Results** -- reads both `index.jsonl` manifests containing evaluation results +1. **Load Results** -- reads both `run_manifest.jsonl` manifests containing evaluation results 2. **Match by test_id** -- pairs results with matching `test_id` fields 3. **Compute Deltas** -- calculates `delta = score2 - score1` for each pair 4. **Compute Normalized Gain** -- calculates `g = delta / (1 - score1)` for each pair (see below) @@ -129,7 +133,7 @@ agentv eval evals/*.yaml --target gpt-4 --output .agentv/results/default/baselin agentv eval evals/*.yaml --target gpt-4o --output .agentv/results/default/candidate # Compare results -agentv compare .agentv/results/default/baseline/index.jsonl .agentv/results/default/candidate/index.jsonl +agentv compare .agentv/results/default/baseline/run_manifest.jsonl .agentv/results/default/candidate/run_manifest.jsonl ``` ### Prompt Optimization @@ -144,7 +148,7 @@ agentv eval evals/*.yaml --output .agentv/results/default/before agentv eval evals/*.yaml --output .agentv/results/default/after # Compare with strict threshold -agentv compare .agentv/results/default/before/index.jsonl .agentv/results/default/after/index.jsonl --threshold 0.05 +agentv compare .agentv/results/default/before/run_manifest.jsonl .agentv/results/default/after/run_manifest.jsonl --threshold 0.05 ``` ### CI Quality Gate @@ -153,7 +157,9 @@ Fail CI if the candidate regresses: ```bash #!/bin/bash -agentv compare baseline.jsonl candidate.jsonl +agentv compare \ + .agentv/results/default/baseline/run_manifest.jsonl \ + .agentv/results/default/candidate/run_manifest.jsonl if [ $? -eq 1 ]; then echo "Regression detected! Candidate performs worse than baseline." exit 1 diff --git a/apps/web/src/content/docs/docs/tools/dashboard.mdx b/apps/web/src/content/docs/docs/tools/dashboard.mdx index 21776a35b..b72810811 100644 --- a/apps/web/src/content/docs/docs/tools/dashboard.mdx +++ b/apps/web/src/content/docs/docs/tools/dashboard.mdx @@ -39,7 +39,7 @@ To open a different project, pass the project root with `--dir`: agentv dashboard --dir /path/to/project ``` -Dashboard does not accept a run workspace directory or `index.jsonl` manifest as a direct source. It reads one configured run source per project: the project's `.agentv/results/` tree, plus an external results repository or run directory configured under `results:` in YAML. The old `.agentv/results/runs/**` layout is not a Dashboard-visible layout. For one-off inspection of a copied run bundle, use `agentv results report `. +Dashboard does not accept a run workspace directory or `run_manifest.jsonl` manifest as a direct source. It reads one configured run source per project: the project's `.agentv/results/` tree, plus an external results repository or run directory configured under `results:` in YAML. The old `.agentv/results/runs/**` layout is not a Dashboard-visible layout. For one-off inspection of a copied run bundle, use `agentv results report `. ## Data boundary @@ -101,7 +101,7 @@ You can also set the same field globally in `$AGENTV_HOME/config.yaml` or `~/.ag ## Run Detail -Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from. Files and source views resolve against the generated run artifacts referenced by `index.jsonl`—including per-result task bundles when present—so Dashboard does not require authors to create a separate source-side bundle structure. +Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (`local` or `remote`) tells you where the run came from. Files and source views resolve against the generated run artifacts referenced by `run_manifest.jsonl`—including per-result task bundles when present—so Dashboard does not require authors to create a separate source-side bundle structure. In the per-test results table, click a test ID to open its checks, transcript, source, files, and feedback in a row detail panel while the table, filters, and scroll position stay in place. Use **Full page** from the panel when you want the standalone eval detail route. @@ -163,7 +163,7 @@ Select 2+ rows with the checkboxes and click the sticky **Compare N** action to ### Retroactive tags -Click any row's **Tags** cell to tag a run after the fact. Each run can carry multiple free-form tags (max 20, up to 60 characters each); local tags are stored in a `tags.json` sidecar next to `index.jsonl` in the timestamped result folder, so they're mutable, non-destructive, and won't touch your eval YAML or run manifest. The chip editor supports Enter/comma to commit a new tag, Backspace to remove the last chip, and **Clear all** to record an empty tag state. The sidecar includes a `tag_revision`; if a stale browser tab submits tags after the run's tags changed, Dashboard rejects the write and asks you to refresh before retrying. +Click any row's **Tags** cell to tag a run after the fact. Each run can carry multiple free-form tags (max 20, up to 60 characters each); local tags are stored in a `tags.json` sidecar next to `run_manifest.jsonl` in the timestamped result folder, so they're mutable, non-destructive, and won't touch your eval YAML or run manifest. The chip editor supports Enter/comma to commit a new tag, Backspace to remove the last chip, and **Clear all** to record an empty tag state. The sidecar includes a `tag_revision`; if a stale browser tab submits tags after the run's tags changed, Dashboard rejects the write and asks you to refresh before retrying. Remote run payloads stay immutable, but their tags are editable. Dashboard writes remote tag changes as metadata overlays under `metadata/runs/.../tags.json` in the configured results repo clone/branch. That overlay path is a remote-results implementation detail, not part of the local `.agentv/results///` layout. Remote tag overlays use the same `tag_revision` stale-write check as local tags. Until those overlays are synced, the run and project show a dirty state; **Sync Project** commits and pushes them when it is safe to do so. @@ -420,7 +420,7 @@ After sync, newly fetched remote runs appear in the list with a **remote** sourc - Safe uncommitted changes under the configured results repo's owned result and metadata paths, such as remote tag overlays under `metadata/runs/**`, are committed and pushed when `sync.auto_push: true`. - A local results repo that is ahead is pushed when `sync.auto_push: true` and the committed paths are all under `.agentv/results/**`. - Dirty non-results files, dirty metadata plus remote changes, unresolved conflicts, missing upstream branches, non-results commits ahead, and rejected pushes are blocked instead of reset. -- Non-fast-forward result branch pushes never force-push. AgentV runs a bounded fetch → merge → push loop that absorbs concurrent remote writes with a real merge commit using artifact-aware Git merge drivers (union for the append-only `index.jsonl`, a JSON-union driver for tag and feedback overlays), so the common append-mostly case auto-merges and pushes as a fast-forward. When Dashboard sync absorbs concurrent remote changes this way, the success feedback includes **Merged remote (auto)**. The removed `sync.push_conflict_policy: backup_and_force_push` value is rejected with migration guidance; remove the field or set it to `block`. +- Non-fast-forward result branch pushes never force-push. AgentV runs a bounded fetch → merge → push loop that absorbs concurrent remote writes with a real merge commit using artifact-aware Git merge drivers (union for the append-only `run_manifest.jsonl`, a JSON-union driver for tag and feedback overlays), so the common append-mostly case auto-merges and pushes as a fast-forward. When Dashboard sync absorbs concurrent remote changes this way, the success feedback includes **Merged remote (auto)**. The removed `sync.push_conflict_policy: backup_and_force_push` value is rejected with migration guidance; remove the field or set it to `block`. - When a genuine overlay conflict cannot be auto-merged, AgentV does not touch the canonical branch. It pushes the local work to a fresh timestamped `agentv/results-sync/--` branch and reports `needs_human_merge` with a `pending_merge` block (temp branch, target branch, and a GitHub compare URL when the remote is on GitHub). The toolbar shows a **Pending merge** card: open the link to merge the branch into the canonical target on GitHub (GitHub's pull request is the conflict surface — AgentV builds no merge UI), then click **I merged it — resync**. That resumes canonical sync by fast-forward-pulling the merged target. A premature click is a safe no-op — local work stays intact and the next sync re-creates a temp branch. When sync is blocked, Dashboard keeps the local clone intact and shows the `block_reason`, `dirty_paths` or `conflicted_paths`, `git_status`, and a compact `git_diff_summary` so you can resolve the results repo manually before syncing again. diff --git a/apps/web/src/content/docs/docs/tools/inspect.mdx b/apps/web/src/content/docs/docs/tools/inspect.mdx index bbac42fbd..6a8b93fbd 100644 --- a/apps/web/src/content/docs/docs/tools/inspect.mdx +++ b/apps/web/src/content/docs/docs/tools/inspect.mdx @@ -9,7 +9,7 @@ The `inspect` command provides headless trace inspection and analysis — no ser Supported sources: -- Run workspaces or `index.jsonl` manifests for summary-level fallback +- Run workspaces or `run_manifest.jsonl` manifests for summary-level fallback - Legacy simple trace JSONL files for read-only migration scenarios - OTLP JSON files written via `agentv eval --otel-file ...` @@ -94,7 +94,7 @@ agentv inspect show trace.otlp.json --format json \ | jq '[.[] | select(.cost_usd > 0.10) | {test_id, score, cost: .cost_usd}]' # Compare providers -agentv inspect stats .agentv/results/default//index.jsonl --group-by target --format json \ +agentv inspect stats .agentv/results/default//run_manifest.jsonl --group-by target --format json \ | jq '.groups[] | {label, score_mean: .metrics.score.mean}' ``` diff --git a/apps/web/src/content/docs/docs/tools/prepare.mdx b/apps/web/src/content/docs/docs/tools/prepare.mdx index c93470fd1..cfc034380 100644 --- a/apps/web/src/content/docs/docs/tools/prepare.mdx +++ b/apps/web/src/content/docs/docs/tools/prepare.mdx @@ -101,4 +101,4 @@ There is no `agentv watch` command. } ``` -Keep the prepared directory with the generated run directory when sharing review evidence. The `index.jsonl` row written by `grade` includes `metadata.prepared_attempt` with the manifest path, workspace path, prompt path, baseline status, and optional trace path. +Keep the prepared directory with the generated run directory when sharing review evidence. The `run_manifest.jsonl` row written by `grade` includes `metadata.prepared_attempt` with the manifest path, workspace path, prompt path, baseline status, and optional trace path. diff --git a/apps/web/src/content/docs/docs/tools/results.mdx b/apps/web/src/content/docs/docs/tools/results.mdx index f4b41c55b..b7cff2f61 100644 --- a/apps/web/src/content/docs/docs/tools/results.mdx +++ b/apps/web/src/content/docs/docs/tools/results.mdx @@ -9,7 +9,7 @@ import { Image } from 'astro:assets'; import resultsReportOverview from '../../../../assets/screenshots/results-report-overview.png'; import resultsReportDetails from '../../../../assets/screenshots/results-report-details.png'; -The `results` command family works on existing local AgentV run workspaces and `index.jsonl` manifests. Use it after an eval run to inspect failures, validate manifests, export artifact layouts, combine/delete local run workspaces, or generate a shareable HTML report. +The `results` command family works on existing local AgentV run workspaces and `run_manifest.jsonl` manifests. Use it after an eval run to inspect failures, validate manifests, export artifact layouts, combine/delete local run workspaces, or generate a shareable HTML report. Remote result repository exchange is intentionally not part of `agentv results`. New eval runs publish completed artifacts to a configured results repo or branch; `sync.auto_push: true` additionally pushes that branch to the remote. Manual remote status and sync are Dashboard/API workflows. See [Dashboard Remote Results](/docs/tools/dashboard/#remote-results) for configuration and sync behavior, and [WIP checkpoints](/docs/tools/wip-checkpoints/) for recovering in-progress runs before final publish. @@ -30,12 +30,12 @@ Remote result repository exchange is intentionally not part of `agentv results`. ## `results report` -The `results report` command turns an existing run workspace or `index.jsonl` manifest into a self-contained HTML report for sharing, inspection, and human review. +The `results report` command turns an existing run workspace or `run_manifest.jsonl` manifest into a self-contained HTML report for sharing, inspection, and human review. AgentV results report overview showing 11 tests across 2 eval files with pass, fail, pass rate, duration, and cost summary cards ```bash -agentv results report +agentv results report ``` Examples: @@ -45,7 +45,7 @@ Examples: agentv results report .agentv/results/default/2026-03-14T10-32-00_claude # Use an explicit output path -agentv results report .agentv/results/default/2026-03-14T10-32-00_claude/index.jsonl \ +agentv results report .agentv/results/default/2026-03-14T10-32-00_claude/run_manifest.jsonl \ --out ./reports/human-review.html ``` @@ -93,12 +93,12 @@ Use `--out docs/.html` when a repository should publish multiple runs. Lin Use `results export` when you need the artifact workspace layout itself rather than a rendered report. ```bash -agentv results export [--out ] [--duplicate-policy update] +agentv results export [--out ] [--duplicate-policy update] ``` -This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving. The run workspace is also where generated task bundles live: `index.jsonl` rows may point to per-result `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` entries. Keep those generated artifacts with the run when sharing or auditing results. +This is useful when a manifest needs to be materialized into a predictable artifact tree for other tooling, review, or archiving. The run workspace is also where generated task bundles live: `run_manifest.jsonl` rows may point to per-result `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` entries. Keep those generated artifacts with the run when sharing or auditing results. -Each exported trace sidecar and `index.jsonl` row includes a stable `projection_identity` derived from AgentV-owned fields: `run_id`, `suite` or `eval_path`, `test_id`, `target`, `source_target`, `attempt`, `variant`, `envelope_id`, `trace_id`, `root_span_id`, and the projection format/version. Retrying the same completed run keeps the same projection ID even when you choose a different `--out` directory, because `run_id` comes from the source run directory or source manifest name rather than the export destination. +Each exported trace sidecar and `run_manifest.jsonl` row includes a stable `projection_identity` derived from AgentV-owned fields: `run_id`, `suite` or `eval_path`, `test_id`, `target`, `source_target`, `attempt`, `variant`, `envelope_id`, `trace_id`, `root_span_id`, and the projection format/version. Retrying the same completed run keeps the same projection ID even when you choose a different `--out` directory, because `run_id` comes from the source run directory or source manifest name rather than the export destination. Duplicate policy is explicit: @@ -130,10 +130,13 @@ when they are available, while `transcript.jsonl` is the normalized conversation transcript with joined `tool_use.result` blocks. AgentV does not persist a public `trace.json` sidecar in run bundles; external observability systems can be linked through safe `external_trace` metadata when available. -`summary.json` remains the run-level aggregate summary, and `index.jsonl` +`summary.json` remains the run-level aggregate summary, and `run_manifest.jsonl` carries lightweight explicit paths such as `transcript_path`, `transcript_raw_path`, and `metrics_path` plus artifact pointers only when detached payload publishing needs them. +New run summaries include `manifest_path: "run_manifest.jsonl"` so tools can +discover the row manifest from `summary.json`, but row and artifact discovery +still uses `run_manifest.jsonl` as the authoritative record. Duration, token, and cost usage remains in `timing.json`, including source labels such as `provider_reported`, `token_estimated`, `aggregate`, or `unavailable`. @@ -164,7 +167,7 @@ Agent Skills eval artifacts map into AgentV like this: | Agent Skills pattern | AgentV field | Artifact location | |----------------------|--------------|-------------------| -| Authored `evals/evals.json` cases | AgentV eval cases and task bundle paths | Eval source plus optional `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` in `index.jsonl` | +| Authored `evals/evals.json` cases | AgentV eval cases and task bundle paths | Eval source plus optional `task_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` in `run_manifest.jsonl` | | Per-case answer | Generated target output artifact | `run-N/outputs/answer.md` | | Per-attempt sidecars | Normalized transcript, metrics, and raw provider evidence | `run-N/transcript.jsonl`, `run-N/transcript-raw.jsonl`, `run-N/metrics.json` | | Per-attempt `timing.json` | Duration, token totals, cost, and usage source labels | `run-N/timing.json` | @@ -179,7 +182,7 @@ Use the additive projection bundle path when an external adapter needs a backend-neutral handoff instead of AgentV's full artifact tree: ```bash -agentv results export --projection-bundle +agentv results export --projection-bundle ``` This writes `projection_bundle.json` next to the exported artifacts. The bundle @@ -196,7 +199,7 @@ transcripts, datasets, experiments, or indexes into Phoenix. For adapter development and CI snapshots, use dry-run mode: ```bash -agentv results export --dry-run > projection_bundle.json +agentv results export --dry-run > projection_bundle.json ``` Dry-run prints deterministic JSON and does not write export artifacts. Vendor @@ -204,7 +207,7 @@ adapters should consume either this JSON directly or the local `projection_bundle.json`. Dry-run refs are marked `artifact_refs.status: "planned_export"` because the export tree has not been written. Bundles written with `--projection-bundle` are built from the emitted -export `index.jsonl` and use `artifact_refs.status: "emitted"`. +export `run_manifest.jsonl` and use `artifact_refs.status: "emitted"`. Raw prompt text, final output, and tool arguments/results are excluded by default, and raw-bearing artifact refs such as `grading_path`, `input_path`, @@ -212,7 +215,7 @@ default, and raw-bearing artifact refs such as `grading_path`, `input_path`, include raw payloads and raw-bearing refs in the bundle, opt in explicitly: ```bash -agentv results export --dry-run --include-raw-content +agentv results export --dry-run --include-raw-content ``` Keep backend-specific anonymization in the adapter layer. For example, an Opik @@ -240,7 +243,7 @@ The CLI contract is deliberately narrow: `agentv results` manages local result a Use these supported remote workflows instead: -- **Automatic publishing:** configure `projects[].results` or top-level `results`; new `agentv eval` and `agentv pipeline bench` runs publish completed artifacts after the run completes. Use `repo.remote` with `repo.path: .` and `repo.branch: agentv/results/v1` to store primary result records on a dedicated branch of the source repo. AgentV never adds or rewrites remotes in an existing checkout; that checkout's `origin` must already point at the repository you want to fetch and push. AgentV reserves `agentv/results/v1` for primary results and `agentv/artifacts/v1` for heavy artifact payloads. When `index.jsonl` rows point trace or transcript payloads at `agentv/artifacts/v1`, automatic publishing stores those bytes on that artifact branch in the same remote and publishes pointer keys such as `runs//`. The configured results branch remains the metadata/control plane (`index.jsonl`, `summary.json`, tags, and pointers) instead of duplicating canonical trace/transcript payload bodies. Local pre-publish run workspaces can still contain those files beside the manifest so local tools keep working. Mutable run tags are stored as `tags.json` with a `tag_revision`; there is no tag event log in the normal results layout. `results.repo.path` without `results.repo.remote` means an existing local Git checkout, distinct from `workspace.repos[].repo`, which is a portable repository identity. Set `sync.auto_push: true` to push after publish. In CI, use `agentv eval run --results-require-push` when push failures should fail that invocation after local artifacts are written. Non-fast-forward result branch pushes never force-push: AgentV auto-merges concurrent remote writes with artifact-aware Git merge drivers (a union driver for the append-only `index.jsonl`, a JSON-union driver for tag and feedback overlays) and pushes the merge as a fast-forward, and routes a genuine overlay conflict to a timestamped `agentv/results-sync/...` branch plus a GitHub compare/PR link for a human merge. The removed `sync.push_conflict_policy: backup_and_force_push` value is rejected with migration guidance; remove the field and rely on the default block-and-ask behavior. While an eval is still running, [WIP checkpoints](/docs/tools/wip-checkpoints/) can keep partial run output durable on `agentv/wip/...` branches when auto-push is enabled. +- **Automatic publishing:** configure `projects[].results` or top-level `results`; new `agentv eval` and `agentv pipeline bench` runs publish completed artifacts after the run completes. Use `repo.remote` with `repo.path: .` and `repo.branch: agentv/results/v1` to store primary result records on a dedicated branch of the source repo. AgentV never adds or rewrites remotes in an existing checkout; that checkout's `origin` must already point at the repository you want to fetch and push. AgentV reserves `agentv/results/v1` for primary results and `agentv/artifacts/v1` for heavy artifact payloads. When `run_manifest.jsonl` rows point trace or transcript payloads at `agentv/artifacts/v1`, automatic publishing stores those bytes on that artifact branch in the same remote and publishes pointer keys such as `runs//`. The configured results branch remains the metadata/control plane (`run_manifest.jsonl`, `summary.json`, tags, and pointers) instead of duplicating canonical trace/transcript payload bodies. Local pre-publish run workspaces can still contain those files beside the manifest so local tools keep working. Mutable run tags are stored as `tags.json` with a `tag_revision`; there is no tag event log in the normal results layout. `results.repo.path` without `results.repo.remote` means an existing local Git checkout, distinct from `workspace.repos[].repo`, which is a portable repository identity. Set `sync.auto_push: true` to push after publish. In CI, use `agentv eval run --results-require-push` when push failures should fail that invocation after local artifacts are written. Non-fast-forward result branch pushes never force-push: AgentV auto-merges concurrent remote writes with artifact-aware Git merge drivers (a union driver for the append-only `run_manifest.jsonl`, a JSON-union driver for tag and feedback overlays) and pushes the merge as a fast-forward, and routes a genuine overlay conflict to a timestamped `agentv/results-sync/...` branch plus a GitHub compare/PR link for a human merge. The removed `sync.push_conflict_policy: backup_and_force_push` value is rejected with migration guidance; remove the field and rely on the default block-and-ask behavior. While an eval is still running, [WIP checkpoints](/docs/tools/wip-checkpoints/) can keep partial run output durable on `agentv/wip/...` branches when auto-push is enabled. - **Manual Dashboard sync:** run `agentv dashboard`, open the project, and use **Sync Project**. - **Manual API sync:** while Dashboard is running, call `GET /api/projects/:projectId/remote/status` or `POST /api/projects/:projectId/remote/sync` for project-scoped automation. Single-project sessions also expose `GET /api/remote/status` and `POST /api/remote/sync`. - **Git escape hatch:** for advanced recovery, inspect or repair the configured `projects[].results.repo.path` clone with `git` directly, then sync again. diff --git a/apps/web/src/content/docs/docs/tools/trend.mdx b/apps/web/src/content/docs/docs/tools/trend.mdx index 342f5ce67..8cd062c93 100644 --- a/apps/web/src/content/docs/docs/tools/trend.mdx +++ b/apps/web/src/content/docs/docs/tools/trend.mdx @@ -25,12 +25,12 @@ Filter to one suite and target: agentv trend --last 8 --suite code-review --target claude-sonnet ``` -Point directly at run workspaces or `index.jsonl` manifests when you need a specific historical slice or want a reproducible example: +Point directly at run workspaces or `run_manifest.jsonl` manifests when you need a specific historical slice or want a reproducible example: ```bash agentv trend \ .agentv/results/default/2026-03-01T10-00-00-000Z/ \ - .agentv/results/default/2026-03-08T10-00-00-000Z/index.jsonl \ + .agentv/results/default/2026-03-08T10-00-00-000Z/run_manifest.jsonl \ .agentv/results/default/2026-03-15T10-00-00-000Z/ ``` @@ -46,9 +46,12 @@ agentv trend --last 8 --suite code-review --target claude-sonnet \ `trend` only accepts canonical run workspaces: - `.agentv/results///` -- `.agentv/results///index.jsonl` +- `.agentv/results///run_manifest.jsonl` -Legacy flat `results.jsonl` files are rejected. The command stays on lightweight `index.jsonl` manifests and does not require per-test artifact hydration. +Legacy `index.jsonl` manifests from older AgentV runs remain readable when +passed directly or when they are the only manifest in a run workspace. Legacy +flat `results.jsonl` files are rejected. The command stays on lightweight +`run_manifest.jsonl` manifests and does not require per-test artifact hydration. ## Options @@ -65,7 +68,7 @@ Legacy flat `results.jsonl` files are rejected. The command stays on lightweight ## How It Works -1. Loads each selected `index.jsonl` manifest. +1. Loads each selected `run_manifest.jsonl` manifest. 2. Applies `suite` and `target` filters per record. 3. By default, reduces every run to the intersection of test IDs present in all selected runs. 4. Computes one mean score per run. @@ -112,7 +115,7 @@ Regression Gate: threshold=0.010 fail_on_degrading=true triggered=true "runs": [ { "label": "2026-03-01T10:00:00.000Z", - "path": "/repo/.agentv/results/default/2026-03-01T10-00-00-000Z/index.jsonl", + "path": "/repo/.agentv/results/default/2026-03-01T10-00-00-000Z/run_manifest.jsonl", "timestamp": "2026-03-01T10:00:00.000Z", "matched_test_count": 42, "mean_score": 0.92 diff --git a/apps/web/src/content/docs/docs/tools/wip-checkpoints.mdx b/apps/web/src/content/docs/docs/tools/wip-checkpoints.mdx index 2eb389fa9..a98a1a278 100644 --- a/apps/web/src/content/docs/docs/tools/wip-checkpoints.mdx +++ b/apps/web/src/content/docs/docs/tools/wip-checkpoints.mdx @@ -23,7 +23,7 @@ If no results repo is configured, or auto-push is disabled, `agentv eval` still | Location | Path or ref | What it contains | | --- | --- | --- | | Local project | `.agentv/results///summary.json` | A run-start stub with `metadata.planned_test_count` and the eval file path when known. This lets Dashboard recognize incomplete local runs as resumable. | -| Local project | `.agentv/results///index.jsonl` | Result rows appended as test cases finish. Rows use the normal snake_case result JSONL format. | +| Local project | `.agentv/results///run_manifest.jsonl` | Result rows appended as test cases finish. Rows use the normal snake_case result JSONL format. | | Results repo remote | `agentv/wip//` | A forced-updated branch containing the checkpointed run under `.agentv/results//`. | | Results repo storage branch | Configured `results.repo.branch`; local checkout configs default to `agentv/results/v1` | The final published run after `agentv eval` completes and the normal auto-export succeeds. | diff --git a/packages/core/src/evaluation/evaluate.ts b/packages/core/src/evaluation/evaluate.ts index be48b831a..2911e533b 100644 --- a/packages/core/src/evaluation/evaluate.ts +++ b/packages/core/src/evaluation/evaluate.ts @@ -205,7 +205,7 @@ export interface EvalConfig { readonly budgetUsd?: number; /** Optional run workspace directory for canonical AgentV artifacts. */ readonly outputDir?: string; - /** Optional experiment name recorded in summary.json and index.jsonl. */ + /** Optional experiment name recorded in summary.json and run_manifest.jsonl. */ readonly experiment?: string; } diff --git a/packages/core/src/evaluation/result-row-schema.ts b/packages/core/src/evaluation/result-row-schema.ts index d5b771031..b2d4bba78 100644 --- a/packages/core/src/evaluation/result-row-schema.ts +++ b/packages/core/src/evaluation/result-row-schema.ts @@ -1,7 +1,7 @@ /** * Result JSONL row schema used at the AgentV artifact boundary. * - * Canonical AgentV run manifests are `index.jsonl` files with snake_case keys + * Canonical AgentV run manifests are `run_manifest.jsonl` files with snake_case keys * and a numeric `score`. Historical rows produced from TypeScript * `EvaluationResult` objects may contain a small set of camelCase aliases. * Normalize those aliases only at this boundary; callers should work with the @@ -16,7 +16,7 @@ export class ResultRowSchemaError extends Error { } const MIGRATION_GUIDANCE = - 'Expected an AgentV result row with a numeric score. Eval-case JSONL is input data, not a results artifact. Run `agentv eval --output ` and pass the run workspace or its index.jsonl manifest.'; + 'Expected an AgentV result row with a numeric score. Eval-case JSONL is input data, not a results artifact. Run `agentv eval --output ` and pass the run workspace or its run_manifest.jsonl manifest.'; const RESULT_ROW_ALIASES = { answerPath: 'answer_path', diff --git a/packages/core/src/evaluation/results-repo-cache.test.ts b/packages/core/src/evaluation/results-repo-cache.test.ts index 6682d3561..b133fe23d 100644 --- a/packages/core/src/evaluation/results-repo-cache.test.ts +++ b/packages/core/src/evaluation/results-repo-cache.test.ts @@ -37,7 +37,7 @@ function writeRun( const runDir = path.join(repoDir, 'runs', experiment, timestamp); mkdirSync(runDir, { recursive: true }); writeFileSync( - path.join(runDir, 'index.jsonl'), + path.join(runDir, 'run_manifest.jsonl'), `${JSON.stringify({ timestamp, test_id: `${experiment}-case`, @@ -51,6 +51,7 @@ function writeRun( path.join(runDir, 'summary.json'), `${JSON.stringify( { + manifest_path: 'run_manifest.jsonl', metadata: { display_name: `${experiment} ${timestamp}`, experiment, diff --git a/packages/core/src/evaluation/results-repo.ts b/packages/core/src/evaluation/results-repo.ts index e86ba98db..1c274f917 100644 --- a/packages/core/src/evaluation/results-repo.ts +++ b/packages/core/src/evaluation/results-repo.ts @@ -66,7 +66,10 @@ const GIT_EMPTY_TREE = '4b825dc642cb6eb9a060e54bf8d69288fbee4904'; // never overwrites the user's git config. See createOrphanResultsBranch. const RESULTS_REPO_GENESIS_MESSAGE = 'chore(results): initialize AgentV results branch'; const RESULTS_REPO_GENESIS_DATE = '@0 +0000'; -const RESULT_INDEX_FILENAME = 'index.jsonl'; +const RESULT_MANIFEST_FILENAME = 'run_manifest.jsonl'; +const LEGACY_RESULT_INDEX_FILENAME = 'index.jsonl'; +const RESULT_INDEX_FILENAME = RESULT_MANIFEST_FILENAME; +const RESULT_MANIFEST_FILENAMES = [RESULT_MANIFEST_FILENAME, LEGACY_RESULT_INDEX_FILENAME] as const; // Artifact-aware merge config for the AgentV-owned results checkout. These two // pieces let `git merge` reconcile concurrent result writes automatically so @@ -80,7 +83,8 @@ const RESULT_INDEX_FILENAME = 'index.jsonl'; // never conflicts on them and they need no attribute. const RESULTS_REPO_GITATTRIBUTES_FILE = '.gitattributes'; const RESULTS_REPO_GITATTRIBUTES_CONTENT = `# Managed by AgentV. Artifact-aware merge so results sync never force-pushes. -# Append-only run index: union concurrent appends (lines are orthogonal). +# Append-only run manifests: union concurrent appends (lines are orthogonal). +run_manifest.jsonl merge=union index.jsonl merge=union # Editable run overlay (tags/feedback): 3-way JSON set/field union via the # agentv-json driver; a genuine scalar conflict falls through to a human merge. @@ -3014,9 +3018,50 @@ function isDeprecatedTraceArtifactPath(relativePath: string): boolean { return relativePath === 'trace.json' || relativePath.endsWith('/trace.json'); } +function isResultManifestFilename(filename: string): boolean { + return RESULT_MANIFEST_FILENAMES.includes(filename as (typeof RESULT_MANIFEST_FILENAMES)[number]); +} + +function safeLocalSummaryManifestPath( + sourceDir: string, + manifestPath: unknown, +): string | undefined { + if (typeof manifestPath !== 'string' || manifestPath.trim().length === 0) { + return undefined; + } + if (path.isAbsolute(manifestPath)) { + return undefined; + } + const normalized = path.normalize(manifestPath); + if (normalized.startsWith('..') || path.isAbsolute(normalized)) { + return undefined; + } + return path.join(sourceDir, normalized); +} + +function resolveLocalResultManifestPath(sourceDir: string): string | undefined { + try { + const summary = JSON.parse(readFileSync(path.join(sourceDir, 'summary.json'), 'utf8')) as { + manifest_path?: unknown; + }; + const manifestPath = safeLocalSummaryManifestPath(sourceDir, summary.manifest_path); + if (manifestPath && existsSync(manifestPath)) { + return manifestPath; + } + } catch {} + + for (const filename of RESULT_MANIFEST_FILENAMES) { + const manifestPath = path.join(sourceDir, filename); + if (existsSync(manifestPath)) { + return manifestPath; + } + } + return undefined; +} + function collectArtifactSidecarPointers(sourceDir: string): ArtifactSidecarPointer[] { - const indexPath = path.join(sourceDir, RESULT_INDEX_FILENAME); - if (!existsSync(indexPath)) { + const indexPath = resolveLocalResultManifestPath(sourceDir); + if (!indexPath) { return []; } @@ -3162,7 +3207,7 @@ async function preparePublishedResultsSource(params: { for (const sourceFile of sourceFiles) { const relativeFile = path.relative(params.sourceDir, sourceFile).split(path.sep).join('/'); const destinationFile = path.join(publishedRoot, ...relativeFile.split('/')); - if (relativeFile === RESULT_INDEX_FILENAME) { + if (isResultManifestFilename(relativeFile)) { const original = readFileSync(sourceFile, 'utf8'); const rewritten = original .split(/\r?\n/) @@ -3811,6 +3856,7 @@ type GitBatchBlob = { }; type GitRunSummary = { + readonly manifest_path?: string; readonly metadata?: { readonly display_name?: string; readonly timestamp?: string; @@ -3826,6 +3872,50 @@ type GitRunSummary = { >; }; +function safeGitSummaryManifestPath(runDir: string, manifestPath: unknown): string | undefined { + if (typeof manifestPath !== 'string' || manifestPath.trim().length === 0) { + return undefined; + } + if (manifestPath.startsWith('/')) { + return undefined; + } + const normalized = path.posix.normalize(manifestPath); + if (normalized === '..' || normalized.startsWith('../')) { + return undefined; + } + return path.posix.join(runDir, normalized); +} + +function buildGitManifestPaths( + treePaths: readonly string[], + summaryByPath: ReadonlyMap, +): string[] { + const treePathSet = new Set(treePaths); + const manifestByRunDir = new Map(); + + for (const [summaryPath, summary] of summaryByPath) { + const runDir = path.posix.dirname(summaryPath); + const manifestPath = safeGitSummaryManifestPath(runDir, summary.manifest_path); + if (manifestPath && treePathSet.has(manifestPath)) { + manifestByRunDir.set(runDir, manifestPath); + } + } + + for (const filename of RESULT_MANIFEST_FILENAMES) { + for (const treePath of treePaths) { + if (!treePath.endsWith(`/${filename}`)) { + continue; + } + const runDir = path.posix.dirname(treePath); + if (!manifestByRunDir.has(runDir)) { + manifestByRunDir.set(runDir, treePath); + } + } + } + + return [...manifestByRunDir.values()].sort(); +} + function buildGitRunId(relativeRunPath: string): string { const normalized = relativeRunPath.split(path.sep).join('/'); const segments = normalized.split('/').filter(Boolean); @@ -4341,22 +4431,7 @@ export async function listGitRuns(repoDir: string, ref = 'origin/main'): Promise .split(/\r?\n/) .map((line) => line.trim()) .filter(Boolean); - const indexPaths = treePaths.filter((line) => line.endsWith('/index.jsonl')); - if (indexPaths.length === 0) { - return []; - } - - const batchInput = `${indexPaths.map((indexPath) => `${ref}:${indexPath}`).join('\n')}\n`; - const blobs = parseGitBatchBlobs(await runGitBatch(repoDir, batchInput)); - if (blobs.length !== indexPaths.length) { - throw new Error( - `Expected ${indexPaths.length} git blobs but received ${blobs.length} while listing results runs`, - ); - } - - const summaryPaths = indexPaths - .map((indexPath) => path.posix.join(path.posix.dirname(indexPath), 'summary.json')) - .filter((summaryPath) => treePaths.includes(summaryPath)); + const summaryPaths = treePaths.filter((line) => line.endsWith('/summary.json')); const summaryByPath = new Map(); if (summaryPaths.length > 0) { const summaryBatchInput = `${summaryPaths.map((summaryPath) => `${ref}:${summaryPath}`).join('\n')}\n`; @@ -4369,6 +4444,19 @@ export async function listGitRuns(repoDir: string, ref = 'origin/main'): Promise } } + const indexPaths = buildGitManifestPaths(treePaths, summaryByPath); + if (indexPaths.length === 0) { + return []; + } + + const batchInput = `${indexPaths.map((indexPath) => `${ref}:${indexPath}`).join('\n')}\n`; + const blobs = parseGitBatchBlobs(await runGitBatch(repoDir, batchInput)); + if (blobs.length !== indexPaths.length) { + throw new Error( + `Expected ${indexPaths.length} git blobs but received ${blobs.length} while listing results runs`, + ); + } + const runs = blobs.flatMap((blob, index): GitListedRun[] => { const manifestPath = indexPaths[index]; const runDir = path.posix.dirname(manifestPath); diff --git a/packages/core/src/evaluation/run-artifacts.ts b/packages/core/src/evaluation/run-artifacts.ts index 27dbd2312..07edb924f 100644 --- a/packages/core/src/evaluation/run-artifacts.ts +++ b/packages/core/src/evaluation/run-artifacts.ts @@ -2,7 +2,7 @@ * Canonical AgentV run artifact helpers. * * This module owns the shared run-workspace contract used by CLI and - * programmatic evals: `index.jsonl`, run-root `summary.json`, per-case + * programmatic evals: `run_manifest.jsonl`, run-root `summary.json`, per-case * `summary.json`, `run-N/result.json`, and transcript projections. Keep wire * keys in snake_case here so every caller produces the same artifacts. */ @@ -55,7 +55,11 @@ import type { TrialResult, } from './types.js'; -export const RESULT_INDEX_FILENAME = 'index.jsonl'; +export const RESULT_MANIFEST_FILENAME = 'run_manifest.jsonl'; +export const LEGACY_RESULT_INDEX_FILENAME = 'index.jsonl'; +// Backward-compatible export name retained for existing callers. New writes use +// the row-level run manifest filename. +export const RESULT_INDEX_FILENAME = RESULT_MANIFEST_FILENAME; export const RUN_SUMMARY_FILENAME = 'summary.json'; const TIMING_SOURCE_VALUES = [ @@ -163,7 +167,8 @@ export async function aggregateRunDir( runtimeSource?: RunRuntimeSourceMetadata; }, ): Promise<{ summaryPath: string; testCount: number; targetCount: number }> { - const indexPath = path.join(runDir, RESULT_INDEX_FILENAME); + const indexPath = + (await resolveExistingResultManifestPath(runDir)) ?? path.join(runDir, RESULT_INDEX_FILENAME); const content = await readFile(indexPath, 'utf8'); const allResults = parseJsonlResults(content); const results = deduplicateByTestIdTarget(allResults); @@ -187,6 +192,54 @@ export async function aggregateRunDir( return { summaryPath, testCount: results.length, targetCount: targetSet.size }; } +async function readTextIfExists(filePath: string): Promise { + return readFile(filePath, 'utf8').catch(() => undefined); +} + +function safeSummaryManifestPath(runDir: string, manifestPath: unknown): string | undefined { + if (typeof manifestPath !== 'string' || manifestPath.trim().length === 0) { + return undefined; + } + if (path.isAbsolute(manifestPath)) { + return undefined; + } + const normalized = path.normalize(manifestPath); + if (normalized.startsWith('..') || path.isAbsolute(normalized)) { + return undefined; + } + return path.join(runDir, normalized); +} + +async function readRunSummaryManifestPath(runDir: string): Promise { + const summaryText = await readTextIfExists(path.join(runDir, RUN_SUMMARY_FILENAME)); + if (!summaryText) { + return undefined; + } + try { + const parsed = JSON.parse(summaryText) as { manifest_path?: unknown }; + const manifestPath = safeSummaryManifestPath(runDir, parsed.manifest_path); + if (manifestPath && (await readTextIfExists(manifestPath)) !== undefined) { + return manifestPath; + } + } catch {} + return undefined; +} + +async function resolveExistingResultManifestPath(runDir: string): Promise { + const summaryManifestPath = await readRunSummaryManifestPath(runDir); + if (summaryManifestPath) { + return summaryManifestPath; + } + + for (const filename of [RESULT_MANIFEST_FILENAME, LEGACY_RESULT_INDEX_FILENAME]) { + const manifestPath = path.join(runDir, filename); + if ((await readTextIfExists(manifestPath)) !== undefined) { + return manifestPath; + } + } + return undefined; +} + async function readRunSummaryMetadata(summaryPath: string): Promise<{ plannedTestCount?: number; runtimeSource?: RunRuntimeSourceMetadata; @@ -340,6 +393,7 @@ export interface TimingArtifact { } export interface RunSummaryArtifact { + readonly manifest_path: string; readonly metadata: { readonly eval_file: string; readonly timestamp: string; @@ -1310,6 +1364,7 @@ export function buildRunSummaryArtifact( const timestamp = firstResult?.timestamp ?? new Date().toISOString(); return { + manifest_path: RESULT_MANIFEST_FILENAME, metadata: { eval_file: evalFile, timestamp, @@ -1566,7 +1621,7 @@ function buildTraceEnvelopeSidecar(params: TraceEnvelopeSidecarParams): TraceEnv runId: params.runId ?? path.basename(params.outputDir), experiment: params.experiment, variant: params.result.variant, - source: { path: RESULT_INDEX_FILENAME }, + source: { path: RESULT_MANIFEST_FILENAME }, capture: { content: 'full', redactionLevel: 'none', redactedFields: [] }, artifacts: { answer_path: params.result.output.length > 0 ? 'outputs/answer.md' : undefined, @@ -1847,8 +1902,11 @@ function projectionIdentityRecordKey(record: unknown): string | undefined { } async function readExistingIndexRecords(outputDir: string): Promise { - const indexPath = path.join(outputDir, RESULT_INDEX_FILENAME); - const content = await readFile(indexPath, 'utf8').catch(() => undefined); + const indexPath = await resolveExistingResultManifestPath(outputDir); + if (!indexPath) { + return []; + } + const content = await readTextIfExists(indexPath); if (content === undefined) { return []; } @@ -1914,8 +1972,11 @@ async function rewriteExistingIndexRecords( return; } - const indexPath = path.join(outputDir, RESULT_INDEX_FILENAME); - const content = await readFile(indexPath, 'utf8').catch(() => undefined); + const indexPath = await resolveExistingResultManifestPath(outputDir); + if (!indexPath) { + return; + } + const content = indexPath ? await readTextIfExists(indexPath) : undefined; if (content === undefined) { return; } diff --git a/packages/core/src/index.ts b/packages/core/src/index.ts index 072bbded2..139c6c5a6 100644 --- a/packages/core/src/index.ts +++ b/packages/core/src/index.ts @@ -58,7 +58,9 @@ export { type EvalSummary, } from './evaluation/evaluate.js'; export { + LEGACY_RESULT_INDEX_FILENAME, RESULT_INDEX_FILENAME, + RESULT_MANIFEST_FILENAME, RUN_SUMMARY_FILENAME, aggregateRunDir, buildAggregateGradingArtifact, diff --git a/packages/core/test/evaluation/evaluate-programmatic-api.test.ts b/packages/core/test/evaluation/evaluate-programmatic-api.test.ts index 97be77621..fd2c4c6b5 100644 --- a/packages/core/test/evaluation/evaluate-programmatic-api.test.ts +++ b/packages/core/test/evaluation/evaluate-programmatic-api.test.ts @@ -11,6 +11,7 @@ import { readFile } from 'node:fs/promises'; import { tmpdir } from 'node:os'; import path from 'node:path'; import { evaluate } from '../../src/evaluation/evaluate.js'; +import { RESULT_INDEX_FILENAME } from '../../src/evaluation/run-artifacts.js'; const PROGRAMMATIC_API_TIMEOUT_MS = 15_000; @@ -131,10 +132,10 @@ describe('evaluate() — programmatic API extensions', () => { expect(result.artifacts).toBeDefined(); expect(result.artifacts?.runDir).toBe(outputDir); - expect(result.artifacts?.indexPath).toBe(path.join(outputDir, 'index.jsonl')); + expect(result.artifacts?.indexPath).toBe(path.join(outputDir, RESULT_INDEX_FILENAME)); expect(result.artifacts?.summaryPath).toBe(path.join(outputDir, 'summary.json')); - const indexContent = await readFile(path.join(outputDir, 'index.jsonl'), 'utf8'); + const indexContent = await readFile(path.join(outputDir, RESULT_INDEX_FILENAME), 'utf8'); expect(indexContent).toContain('"test_id":"programmatic-artifacts"'); expect(indexContent).toContain('"experiment":"sdk-test"'); const [indexRow] = indexContent diff --git a/packages/core/test/evaluation/orchestrator.test.ts b/packages/core/test/evaluation/orchestrator.test.ts index 9ddaa1c05..64bdc0002 100644 --- a/packages/core/test/evaluation/orchestrator.test.ts +++ b/packages/core/test/evaluation/orchestrator.test.ts @@ -30,7 +30,10 @@ import { type ReplayFixtureRecord, serializeReplayFixtureRecord, } from '../../src/evaluation/replay-fixtures.js'; -import { writeArtifactsFromResults } from '../../src/evaluation/run-artifacts.js'; +import { + RESULT_INDEX_FILENAME, + writeArtifactsFromResults, +} from '../../src/evaluation/run-artifacts.js'; import { RunBudgetTracker } from '../../src/evaluation/run-budget-tracker.js'; import { buildTraceEnvelopeFromEvaluationResult, @@ -760,7 +763,7 @@ console.log('spreadsheet: revenue,total\\nQ1,42');`, const outputDir = path.join(tempDir, 'artifacts'); await writeArtifactsFromResults([result], outputDir); - const indexRows = readFileSync(path.join(outputDir, 'index.jsonl'), 'utf8') + const indexRows = readFileSync(path.join(outputDir, RESULT_INDEX_FILENAME), 'utf8') .trim() .split('\n') .map((line) => JSON.parse(line) as Record); diff --git a/packages/core/test/evaluation/results-repo.test.ts b/packages/core/test/evaluation/results-repo.test.ts index 651df142d..ab9490bc1 100644 --- a/packages/core/test/evaluation/results-repo.test.ts +++ b/packages/core/test/evaluation/results-repo.test.ts @@ -239,11 +239,12 @@ function randomToken(): string { function writeRunArtifacts(runDir: string, experiment: string, timestamp: string): void { mkdirSync(runDir, { recursive: true }); - writeFileSync(path.join(runDir, 'index.jsonl'), '{"test_id":"alpha"}\n'); + writeFileSync(path.join(runDir, 'run_manifest.jsonl'), '{"test_id":"alpha"}\n'); writeFileSync( path.join(runDir, 'summary.json'), JSON.stringify( { + manifest_path: 'run_manifest.jsonl', metadata: { timestamp, experiment, @@ -294,7 +295,7 @@ function writeRunArtifactsWithPointers( const legacyTraceSha = sha256Hex(legacyTraceContent); const transcriptSha = sha256Hex(transcriptContent); writeFileSync( - path.join(runDir, 'index.jsonl'), + path.join(runDir, 'run_manifest.jsonl'), `${JSON.stringify({ test_id: 'alpha', score: 1, @@ -402,7 +403,7 @@ describe('listGitRuns', () => { rmSync(repoDir, { recursive: true, force: true }); }); - it('returns committed runs derived from index.jsonl manifests', async () => { + it('returns committed runs derived from run manifests and legacy index.jsonl manifests', async () => { const defaultRunDir = path.join(repoDir, 'runs', 'default', '2026-05-20T10-00-00-000Z'); mkdirSync(defaultRunDir, { recursive: true }); writeFileSync( @@ -443,7 +444,7 @@ describe('listGitRuns', () => { const experimentRunDir = path.join(repoDir, 'runs', 'with-skills', '2026-05-21T11-00-00-000Z'); mkdirSync(experimentRunDir, { recursive: true }); writeFileSync( - path.join(experimentRunDir, 'index.jsonl'), + path.join(experimentRunDir, 'run_manifest.jsonl'), `${[ JSON.stringify({ test_id: 'alpha', @@ -466,6 +467,7 @@ describe('listGitRuns', () => { path.join(experimentRunDir, 'summary.json'), JSON.stringify( { + manifest_path: 'run_manifest.jsonl', metadata: { display_name: 'remote friendly run', timestamp: '2026-05-21T11:00:00.000Z', @@ -500,7 +502,7 @@ describe('listGitRuns', () => { experiment: 'with-skills', timestamp: '2026-05-21T11:00:00.000Z', display_name: 'remote friendly run', - manifest_path: 'runs/with-skills/2026-05-21T11-00-00-000Z/index.jsonl', + manifest_path: 'runs/with-skills/2026-05-21T11-00-00-000Z/run_manifest.jsonl', summary_path: 'runs/with-skills/2026-05-21T11-00-00-000Z/summary.json', test_count: 3, pass_rate: 0.75, @@ -518,6 +520,58 @@ describe('listGitRuns', () => { expect(runs[0].size_bytes).toBeGreaterThan(0); }); + it('does not double-count a remote bundle that has both manifest filenames', async () => { + const runDir = path.join(repoDir, 'runs', 'default', '2026-05-22T12-00-00-000Z'); + mkdirSync(runDir, { recursive: true }); + const canonical = `${JSON.stringify({ + test_id: 'canonical', + target: 'codex', + score: 1, + timestamp: '2026-05-22T12:00:00.000Z', + })}\n`; + writeFileSync(path.join(runDir, 'run_manifest.jsonl'), canonical); + writeFileSync( + path.join(runDir, 'index.jsonl'), + `${JSON.stringify({ + test_id: 'legacy', + target: 'codex', + score: 0, + timestamp: '2026-05-22T12:00:00.000Z', + })}\n`, + ); + writeFileSync( + path.join(runDir, 'summary.json'), + JSON.stringify( + { + manifest_path: 'run_manifest.jsonl', + metadata: { + timestamp: '2026-05-22T12:00:00.000Z', + targets: ['codex'], + tests_run: ['canonical'], + }, + run_summary: { + codex: { + pass_rate: { mean: 1 }, + }, + }, + }, + null, + 2, + ), + ); + git('git add runs && git commit -m "seed duplicate manifests"', repoDir); + + const runs = await listGitRuns(repoDir, 'HEAD'); + + expect(runs).toHaveLength(1); + expect(runs[0]).toMatchObject({ + run_id: '2026-05-22T12-00-00-000Z', + manifest_path: 'runs/default/2026-05-22T12-00-00-000Z/run_manifest.jsonl', + test_count: 1, + avg_score: 1, + }); + }); + it('returns an empty list when the ref has no committed runs', async () => { writeFileSync(path.join(repoDir, 'README.md'), '# test\n'); git('git add README.md && git commit -m "initial"', repoDir); @@ -536,7 +590,7 @@ describe('listGitRuns', () => { const runDir = path.join(repoDir, 'runs', 'default', '2026-05-20T10-00-00-000Z'); mkdirSync(runDir, { recursive: true }); writeFileSync( - path.join(runDir, 'index.jsonl'), + path.join(runDir, 'run_manifest.jsonl'), `${JSON.stringify({ test_id: 'alpha', target: 'gpt-4o', @@ -547,6 +601,7 @@ describe('listGitRuns', () => { path.join(runDir, 'summary.json'), JSON.stringify( { + manifest_path: 'run_manifest.jsonl', metadata: { timestamp: '2026-05-20T10:00:00.000Z', targets: ['gpt-4o'], @@ -591,10 +646,11 @@ describe('listGitRuns', () => { it('materializes an entire run subtree atomically from git objects', async () => { const runDir = path.join(repoDir, 'runs', 'with-files', '2026-05-22T10-00-00-000Z'); mkdirSync(path.join(runDir, 'attachments'), { recursive: true }); - writeFileSync(path.join(runDir, 'index.jsonl'), '{"test_id":"alpha"}\n'); + writeFileSync(path.join(runDir, 'run_manifest.jsonl'), '{"test_id":"alpha"}\n'); writeFileSync( path.join(runDir, 'summary.json'), JSON.stringify({ + manifest_path: 'run_manifest.jsonl', metadata: { timestamp: '2026-05-22T10:00:00.000Z', experiment: 'with-files', @@ -615,7 +671,9 @@ describe('listGitRuns', () => { await materializeGitRun(repoDir, 'with-files/2026-05-22T10-00-00-000Z', 'HEAD'); - expect(readFileSync(path.join(runDir, 'index.jsonl'), 'utf8')).toContain('"test_id":"alpha"'); + expect(readFileSync(path.join(runDir, 'run_manifest.jsonl'), 'utf8')).toContain( + '"test_id":"alpha"', + ); expect(readFileSync(path.join(runDir, 'attachments', 'response.md'), 'utf8')).toBe( 'hello from git\n', ); @@ -1332,7 +1390,7 @@ describe('results repo write path', () => { expect(published).toBe(true); expect(git('git branch --show-current', resultsRepoDir)).toBe('main'); const branchFiles = git(`git ls-tree -r --name-only ${DEFAULT_RESULTS_BRANCH}`, resultsRepoDir); - expect(branchFiles).toContain(`runs/external/${runTimestamp}/index.jsonl`); + expect(branchFiles).toContain(`runs/external/${runTimestamp}/run_manifest.jsonl`); expect(branchFiles).not.toContain('README.md'); }, 20000); @@ -1564,7 +1622,7 @@ describe('results repo write path', () => { `AgentV-Run: with-skills::${runTimestamp}`, ); expect(git('git ls-tree -r --name-only main', cloneDir)).toContain( - `runs/with-skills/${runTimestamp}/index.jsonl`, + `runs/with-skills/${runTimestamp}/run_manifest.jsonl`, ); const runs = await listGitRuns(cloneDir, 'main'); @@ -1637,7 +1695,7 @@ describe('results repo write path', () => { `git --git-dir "${remoteDir}" ls-tree -r --name-only ${storageBranch}`, rootDir, ); - expect(resultTree).toContain(`runs/${destinationPath}/index.jsonl`); + expect(resultTree).toContain(`runs/${destinationPath}/run_manifest.jsonl`); expect(resultTree).toContain(`runs/${destinationPath}/summary.json`); expect(resultTree).not.toContain(`runs/${destinationPath}/alpha/trace.json`); expect(resultTree).not.toContain(`runs/${destinationPath}/alpha/transcript.jsonl`); @@ -1649,11 +1707,11 @@ describe('results repo write path', () => { expect(artifactTree).not.toContain(`runs/${destinationPath}/alpha/trace.json`); expect(artifactTree).toContain(`runs/${destinationPath}/alpha/transcript.jsonl`); expect(artifactTree).not.toContain(`runs/${destinationPath}/summary.json`); - expect(artifactTree).not.toContain(`runs/${destinationPath}/index.jsonl`); + expect(artifactTree).not.toContain(`runs/${destinationPath}/run_manifest.jsonl`); const index = JSON.parse( gitRaw( - `git --git-dir "${remoteDir}" show ${storageBranch}:runs/${destinationPath}/index.jsonl`, + `git --git-dir "${remoteDir}" show ${storageBranch}:runs/${destinationPath}/run_manifest.jsonl`, rootDir, ).toString('utf8'), ); @@ -1740,7 +1798,7 @@ describe('results repo write path', () => { `git --git-dir "${remoteDir}" ls-tree -r --name-only ${storageBranch}`, rootDir, ); - expect(resultTree).toContain(`runs/${destinationPath}/index.jsonl`); + expect(resultTree).toContain(`runs/${destinationPath}/run_manifest.jsonl`); expect(resultTree).toContain(`runs/${destinationPath}/summary.json`); expect(resultTree).not.toContain(`runs/${destinationPath}/alpha/trace.json`); expect(resultTree).not.toContain(`runs/${destinationPath}/alpha/transcript.jsonl`); @@ -2124,17 +2182,17 @@ describe('results repo write path', () => { ).toBe(''); }, 20000); - it('union-merges concurrent appends to the same run index without conflict', async () => { + it('union-merges concurrent appends to the same run manifest without conflict', async () => { const { remoteDir, seedDir } = initializeRemoteRepo(rootDir); const cloneDir = path.join(rootDir, 'results-clone'); const config = createResultsConfig(remoteDir, cloneDir); - const indexRel = path.join('runs', 'shared', '2026-05-25T12-00-00-000Z', 'index.jsonl'); + const indexRel = path.join('runs', 'shared', '2026-05-25T12-00-00-000Z', 'run_manifest.jsonl'); await ensureResultsRepoClone(config); git('git config user.email "test@example.com"', cloneDir); git('git config user.name "Test User"', cloneDir); - // Seed a shared run index on the remote and pull it into the clone. + // Seed a shared run manifest on the remote and pull it into the clone. const seedIndex = path.join(seedDir, indexRel); mkdirSync(path.dirname(seedIndex), { recursive: true }); writeFileSync(seedIndex, '{"test_id":"base"}\n'); diff --git a/skills-data/agentv-eval-writer/SKILL.md b/skills-data/agentv-eval-writer/SKILL.md index 0941e7e59..b2d0a053a 100644 --- a/skills-data/agentv-eval-writer/SKILL.md +++ b/skills-data/agentv-eval-writer/SKILL.md @@ -608,16 +608,16 @@ agentv eval assert --agent-output "..." --agent-input "..." agentv import claude --session-id # Re-run only execution errors from a previous run -agentv eval --retry-errors .agentv/results/default//index.jsonl +agentv eval --retry-errors .agentv/results/default//run_manifest.jsonl # Validate eval file agentv validate # Compare results — N-way matrix from a canonical run manifest -agentv compare .agentv/results/default//index.jsonl -agentv compare .agentv/results/default//index.jsonl --baseline # CI regression gate -agentv compare .agentv/results/default//index.jsonl --baseline --candidate # pairwise -agentv compare .agentv/results/default//index.jsonl .agentv/results/default//index.jsonl +agentv compare .agentv/results/default//run_manifest.jsonl +agentv compare .agentv/results/default//run_manifest.jsonl --baseline # CI regression gate +agentv compare .agentv/results/default//run_manifest.jsonl --baseline --candidate # pairwise +agentv compare .agentv/results/default//run_manifest.jsonl .agentv/results/default//run_manifest.jsonl # Author assertions directly in the eval file # Prefer simple assertions when they fit the criteria; use deterministic or LLM-based graders when needed