Skip to content

feat(examples): add deterministic oracle sweep#1559

Closed
christso wants to merge 6 commits into
mainfrom
feat/example-oracle-fixtures
Closed

feat(examples): add deterministic oracle sweep#1559
christso wants to merge 6 commits into
mainfrom
feat/example-oracle-fixtures

Conversation

@christso

@christso christso commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a deterministic oracle sweep for agent/LLM-backed example evals so maintainers can prove those examples still load, execute, and write artifacts without provider credentials or live LLM calls.

This is a contract oracle for example execution, not a set of captured live-model golden transcripts. Examples that already use oracle-capable targets are classified as covered and are not replay-overridden.

Oracle Definition and Coverage

In this PR, an example eval's oracle is defined by its target kind:

  • Oracle-capable targets are their own oracle. These include mock, cli, replay, transcript, and passive copilot-log providers. The sweep inventories them as oracle_target and does not replace their target output.
  • Agent or LLM-backed targets require an oracle fixture. These include concrete model providers, interactive agent providers, and unresolved env-delegated targets such as root llm -> ${{ LLM_TARGET }} or default -> ${{ AGENT_TARGET }}. The sweep generates replay fixture rows for these evals and runs them with --target example_oracle plus --grader-target example_oracle_grader.
  • Generated replay outputs come from existing reference_answer / expected_output when available, otherwise from deterministic assertion-derived content. The goal is contract coverage for parsing, execution, grading plumbing, and artifact writing without provider keys, not semantic live-model quality scoring.

Current coverage from bun run examples:oracle -- --inventory:

  • 113 eval files discovered under examples/
  • 85 requires_oracle_fixture: run through generated replay fixtures
  • 24 oracle_target: already covered by oracle-capable local/static/replay targets
  • 0 needs_fixture_added
  • 4 excluded with explicit reasons below

What changed

  • Adds bun run examples:oracle as the standard workflow for running non-oracle example evals against a generated replay target and oracle CLI grader target.
  • Adds bun run examples:oracle:gate as the modular gate wrapper: it runs validate:examples, then forwards arguments to examples:oracle.
  • Adds examples/oracle-fixtures.yaml as the maintained oracle inventory control file with explicit exclusions.
  • Classifies example eval targets as requires_oracle_fixture, oracle_target, or excluded.
  • Generates replay fixtures from reference answers, expected assistant outputs, or assertion-derived deterministic content at runtime under .agentv/tmp/ only for evals that otherwise require an agent or LLM target.
  • Adds a publish-next GitHub Actions gate that runs bun run build once, then bun run examples:oracle:gate, before npm publish. The next publish path then calls bun scripts/publish.ts next directly to avoid a second build.
  • Removes stale examples/features/workspace-setup-script/evals/dataset-vscode.eval.yaml because VSCode target support was removed.
  • Updates the cross-repo TypeScript runner to prefer Bun for .ts scripts, avoiding an unnecessary tsx dependency during offline oracle runs.
  • Documents the oracle sweep in examples/README.md.

Exclusions

  • examples/features/docker-workspace/evals/docker-example.EVAL.yaml: requires Docker workspace runtime; replaying target output does not remove Docker setup dependency.
  • examples/features/copilot-log-eval/evals/skill-trigger.EVAL.yaml: requires local Copilot session/workspace artifacts and .github/skills/agentv-bench/SKILL.md during before_all setup.
  • examples/features/prompt-template-sdk/evals/dataset.eval.yaml: loader returns zero runnable tests because the eval references missing ../prompts/custom-grader.ts command file.
  • examples/showcase/bug-fix-benchmark/evals/bug-fixes.eval.yaml: invalid eval metadata today; experiment targets[0].use_target is empty.

Verification

  • bun install
  • bun run build
  • bun run examples:oracle -- --eval examples/features/assert/evals/dataset.eval.yaml --eval examples/features/default-graders/evals/dataset.eval.yaml --eval examples/showcase/grader-conformance/EVAL.yaml --output-dir .agentv/tmp/example-oracle-smoke
  • bun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-full-2 passed before target classification narrowing: 110 runnable, 0 failures, 4 excluded
  • bun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-agent-targets passed after target classification narrowing: 86 oracle-required, 0 failures, 24 oracle-capable targets, 4 excluded
  • bun run build && bun run validate:examples && bun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-publish-next-check passed after removing stale VSCode eval: 85 oracle-required, 0 failures, 24 oracle targets, 4 excluded
  • bun run examples:oracle:gate -- --inventory
  • bun run validate:examples
  • bun run examples:oracle -- --inventory
  • bun run lint
  • bun run typecheck
  • actionlint not installed locally; workflow syntax was reviewed by inspection

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 29, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 30f8d92
Status: ✅  Deploy successful!
Preview URL: https://d70bbf7f.agentv.pages.dev
Branch Preview URL: https://feat-example-oracle-fixtures.agentv.pages.dev

View logs

@christso christso force-pushed the feat/example-oracle-fixtures branch from b656e15 to 6c94d36 Compare June 29, 2026 05:45
@christso christso force-pushed the feat/example-oracle-fixtures branch from 6c94d36 to 06181bb Compare June 29, 2026 06:34
@christso

Copy link
Copy Markdown
Collaborator Author

Superseded by Bead av-rjz6. Scope is reduced to oracle/replay only for expensive agent targets while keeping declared graders unchanged; this draft used mocked grader plumbing and should not be merged.

@christso christso closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant