Skip to content

fix(results): isolate target bundles under run timestamps#1558

Merged
christso merged 4 commits into
mainfrom
result-row-id-sidecars
Jun 29, 2026
Merged

fix(results): isolate target bundles under run timestamps#1558
christso merged 4 commits into
mainfrom
result-row-id-sidecars

Conversation

@christso

@christso christso commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Multi-target eval runs now fan out into isolated target/variant result bundles under the invocation timestamp, so two targets with the same test_id no longer point at and overwrite the same sidecar files. Within each bundle, row sidecar directories use deterministic compact row IDs based on eval source, suite, test id, target, and variant, while index.jsonl remains the source of truth for identity and artifact paths.

Dashboard and result readers now discover nested bundle manifests under timestamp folders while preserving legacy root-level index.jsonl bundles. They may walk storage folders to find manifests, but target and variant semantics continue to come from loaded row/run metadata, not path parsing.

Fixes #1557.

Validation

  • Rebased onto current origin/main; branch freshness check git log HEAD..origin/main is empty.
  • bun install after rebase to materialize declared dependencies.
  • bun run build
  • bun run typecheck
  • Focused tests for eval artifact writing, aggregation, nested Dashboard discovery/drilldown, export, rerun, integration output paths, and core evaluation APIs: 394 passed, 0 failed.
  • Earlier full local bun run test: core 2063 passed, CLI 723 passed, SDK 88 passed, Dashboard 142 passed.
  • Local two-target mock eval dogfood: produced distinct mock-alpha/index.jsonl and mock-beta/index.jsonl bundles with separate row IDs and answer files.
  • Same-test_id/duplicate-suite dogfood: produced two distinct row sidecar directories inside one target bundle and preserved both summary rows.
  • Dashboard browser/API UAT against a canonical nested-bundle project: list, detail, file drilldown, and compare all loaded both targets from metadata.
  • Remote sync dogfood with a file-backed agentv/results/v1 branch: pushed nested bundles, cleared local materialized results/cache, synced through Dashboard project APIs, and loaded remote list/detail/compare/drilldown successfully.
  • Live provider/grader dogfood through the local OpenAI-compatible OAuth proxy: codex target with api_format: responses plus openai grader target with api_format: chat, all credentials/model routed through LOCAL_OPENAI_PROXY_* env refs. Result: PASS, 1/1, score 1.0. Run bundle: /tmp/agentv-av9vi-live-utlpOx/.agentv/results/av9vi-live-dogfood/2026-06-29T02-51-25-032Z/codex-local-proxy/index.jsonl; row sidecar: live-proxy-case--3d1e8b8bde59/run-1/; grader type: llm-grader; answer: agentv live dogfood ok.

Compound Engineering
GPT-5

Private evidence: https://github.com/EntityProcess/agentv-private/tree/evidence/av-9vi-result-row-sidecars (commit 0b585d2) preserves the live local-proxy Codex target + OpenAI grader run bundle on an orphan branch.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 29, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: f656fd4
Status: ✅  Deploy successful!
Preview URL: https://e72a68d4.agentv.pages.dev
Branch Preview URL: https://result-row-id-sidecars.agentv.pages.dev

View logs

@christso christso force-pushed the result-row-id-sidecars branch from 45ad352 to 458d292 Compare June 29, 2026 02:31
christso added 2 commits June 29, 2026 06:36
Merge PR #1560 for Bead av-k0e after independent read-only code review reported no actionable issues and verification passed.
@christso christso merged commit cd02dee into main Jun 29, 2026
8 checks passed
@christso christso deleted the result-row-id-sidecars branch June 29, 2026 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix result sidecar identity for multi-target runs

1 participant