feat(examples): add deterministic oracle sweep by christso · Pull Request #1559 · EntityProcess/agentv

christso · 2026-06-29T04:07:16Z

Summary

Adds a deterministic oracle sweep for agent/LLM-backed example evals so maintainers can prove those examples still load, execute, and write artifacts without provider credentials or live LLM calls.

This is a contract oracle for example execution, not a set of captured live-model golden transcripts. Examples that already use oracle-capable targets are classified as covered and are not replay-overridden.

Oracle Definition and Coverage

In this PR, an example eval's oracle is defined by its target kind:

Oracle-capable targets are their own oracle. These include mock, cli, replay, transcript, and passive copilot-log providers. The sweep inventories them as oracle_target and does not replace their target output.
Agent or LLM-backed targets require an oracle fixture. These include concrete model providers, interactive agent providers, and unresolved env-delegated targets such as root llm -> ${{ LLM_TARGET }} or default -> ${{ AGENT_TARGET }}. The sweep generates replay fixture rows for these evals and runs them with --target example_oracle plus --grader-target example_oracle_grader.
Generated replay outputs come from existing reference_answer / expected_output when available, otherwise from deterministic assertion-derived content. The goal is contract coverage for parsing, execution, grading plumbing, and artifact writing without provider keys, not semantic live-model quality scoring.

Current coverage from bun run examples:oracle -- --inventory:

113 eval files discovered under examples/
85 requires_oracle_fixture: run through generated replay fixtures
24 oracle_target: already covered by oracle-capable local/static/replay targets
0 needs_fixture_added
4 excluded with explicit reasons below

What changed

Adds bun run examples:oracle as the standard workflow for running non-oracle example evals against a generated replay target and oracle CLI grader target.
Adds bun run examples:oracle:gate as the modular gate wrapper: it runs validate:examples, then forwards arguments to examples:oracle.
Adds examples/oracle-fixtures.yaml as the maintained oracle inventory control file with explicit exclusions.
Classifies example eval targets as requires_oracle_fixture, oracle_target, or excluded.
Generates replay fixtures from reference answers, expected assistant outputs, or assertion-derived deterministic content at runtime under .agentv/tmp/ only for evals that otherwise require an agent or LLM target.
Adds a publish-next GitHub Actions gate that runs bun run build once, then bun run examples:oracle:gate, before npm publish. The next publish path then calls bun scripts/publish.ts next directly to avoid a second build.
Removes stale examples/features/workspace-setup-script/evals/dataset-vscode.eval.yaml because VSCode target support was removed.
Updates the cross-repo TypeScript runner to prefer Bun for .ts scripts, avoiding an unnecessary tsx dependency during offline oracle runs.
Documents the oracle sweep in examples/README.md.

Exclusions

examples/features/docker-workspace/evals/docker-example.EVAL.yaml: requires Docker workspace runtime; replaying target output does not remove Docker setup dependency.
examples/features/copilot-log-eval/evals/skill-trigger.EVAL.yaml: requires local Copilot session/workspace artifacts and .github/skills/agentv-bench/SKILL.md during before_all setup.
examples/features/prompt-template-sdk/evals/dataset.eval.yaml: loader returns zero runnable tests because the eval references missing ../prompts/custom-grader.ts command file.
examples/showcase/bug-fix-benchmark/evals/bug-fixes.eval.yaml: invalid eval metadata today; experiment targets[0].use_target is empty.

Verification

bun install
bun run build
bun run examples:oracle -- --eval examples/features/assert/evals/dataset.eval.yaml --eval examples/features/default-graders/evals/dataset.eval.yaml --eval examples/showcase/grader-conformance/EVAL.yaml --output-dir .agentv/tmp/example-oracle-smoke
bun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-full-2 passed before target classification narrowing: 110 runnable, 0 failures, 4 excluded
bun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-agent-targets passed after target classification narrowing: 86 oracle-required, 0 failures, 24 oracle-capable targets, 4 excluded
bun run build && bun run validate:examples && bun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-publish-next-check passed after removing stale VSCode eval: 85 oracle-required, 0 failures, 24 oracle targets, 4 excluded
bun run examples:oracle:gate -- --inventory
bun run validate:examples
bun run examples:oracle -- --inventory
bun run lint
bun run typecheck
actionlint not installed locally; workflow syntax was reviewed by inspection

cloudflare-workers-and-pages · 2026-06-29T04:08:02Z

Deploying agentv with Cloudflare Pages

Latest commit:	`30f8d92`
Status:	✅ Deploy successful!
Preview URL:	https://d70bbf7f.agentv.pages.dev
Branch Preview URL:	https://feat-example-oracle-fixtures.agentv.pages.dev

View logs

christso · 2026-06-29T09:18:05Z

Superseded by Bead av-rjz6. Scope is reduced to oracle/replay only for expensive agent targets while keeping declared graders unchanged; this draft used mocked grader plumbing and should not be merged.

christso force-pushed the feat/example-oracle-fixtures branch from b656e15 to 6c94d36 Compare June 29, 2026 05:45

christso added 4 commits June 29, 2026 08:34

feat(examples): add deterministic oracle sweep

5aaa80a

docs(examples): clarify oracle sweep semantics

cd757f0

fix(examples): limit oracle sweep to non-deterministic targets

04b30f0

fix(examples): align oracle naming and publish gate

06181bb

christso force-pushed the feat/example-oracle-fixtures branch from 6c94d36 to 06181bb Compare June 29, 2026 06:34

christso added 2 commits June 29, 2026 08:54

ci(publish): validate examples before oracle gate

1b67232

ci(publish): wrap example oracle gate

30f8d92

christso closed this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(examples): add deterministic oracle sweep#1559

feat(examples): add deterministic oracle sweep#1559
christso wants to merge 6 commits into
mainfrom
feat/example-oracle-fixtures

christso commented Jun 29, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

christso commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Oracle Definition and Coverage

What changed

Exclusions

Verification

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jun 29, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jun 29, 2026 •

edited

Loading