feat(examples): add deterministic oracle sweep#1559
Closed
christso wants to merge 6 commits into
Closed
Conversation
Deploying agentv with
|
| Latest commit: |
30f8d92
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://d70bbf7f.agentv.pages.dev |
| Branch Preview URL: | https://feat-example-oracle-fixtures.agentv.pages.dev |
b656e15 to
6c94d36
Compare
6c94d36 to
06181bb
Compare
Collaborator
Author
|
Superseded by Bead av-rjz6. Scope is reduced to oracle/replay only for expensive agent targets while keeping declared graders unchanged; this draft used mocked grader plumbing and should not be merged. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a deterministic oracle sweep for agent/LLM-backed example evals so maintainers can prove those examples still load, execute, and write artifacts without provider credentials or live LLM calls.
This is a contract oracle for example execution, not a set of captured live-model golden transcripts. Examples that already use oracle-capable targets are classified as covered and are not replay-overridden.
Oracle Definition and Coverage
In this PR, an example eval's oracle is defined by its target kind:
mock,cli,replay,transcript, and passivecopilot-logproviders. The sweep inventories them asoracle_targetand does not replace their target output.llm -> ${{ LLM_TARGET }}ordefault -> ${{ AGENT_TARGET }}. The sweep generates replay fixture rows for these evals and runs them with--target example_oracleplus--grader-target example_oracle_grader.reference_answer/expected_outputwhen available, otherwise from deterministic assertion-derived content. The goal is contract coverage for parsing, execution, grading plumbing, and artifact writing without provider keys, not semantic live-model quality scoring.Current coverage from
bun run examples:oracle -- --inventory:examples/requires_oracle_fixture: run through generated replay fixturesoracle_target: already covered by oracle-capable local/static/replay targetsneeds_fixture_addedexcludedwith explicit reasons belowWhat changed
bun run examples:oracleas the standard workflow for running non-oracle example evals against a generated replay target and oracle CLI grader target.bun run examples:oracle:gateas the modular gate wrapper: it runsvalidate:examples, then forwards arguments toexamples:oracle.examples/oracle-fixtures.yamlas the maintained oracle inventory control file with explicit exclusions.requires_oracle_fixture,oracle_target, orexcluded..agentv/tmp/only for evals that otherwise require an agent or LLM target.bun run buildonce, thenbun run examples:oracle:gate, before npm publish. The next publish path then callsbun scripts/publish.ts nextdirectly to avoid a second build.examples/features/workspace-setup-script/evals/dataset-vscode.eval.yamlbecause VSCode target support was removed..tsscripts, avoiding an unnecessarytsxdependency during offline oracle runs.examples/README.md.Exclusions
examples/features/docker-workspace/evals/docker-example.EVAL.yaml: requires Docker workspace runtime; replaying target output does not remove Docker setup dependency.examples/features/copilot-log-eval/evals/skill-trigger.EVAL.yaml: requires local Copilot session/workspace artifacts and.github/skills/agentv-bench/SKILL.mdduringbefore_allsetup.examples/features/prompt-template-sdk/evals/dataset.eval.yaml: loader returns zero runnable tests because the eval references missing../prompts/custom-grader.tscommand file.examples/showcase/bug-fix-benchmark/evals/bug-fixes.eval.yaml: invalid eval metadata today;experiment targets[0].use_targetis empty.Verification
bun installbun run buildbun run examples:oracle -- --eval examples/features/assert/evals/dataset.eval.yaml --eval examples/features/default-graders/evals/dataset.eval.yaml --eval examples/showcase/grader-conformance/EVAL.yaml --output-dir .agentv/tmp/example-oracle-smokebun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-full-2passed before target classification narrowing: 110 runnable, 0 failures, 4 excludedbun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-agent-targetspassed after target classification narrowing: 86 oracle-required, 0 failures, 24 oracle-capable targets, 4 excludedbun run build && bun run validate:examples && bun run examples:oracle -- --output-dir .agentv/tmp/example-oracle-publish-next-checkpassed after removing stale VSCode eval: 85 oracle-required, 0 failures, 24 oracle targets, 4 excludedbun run examples:oracle:gate -- --inventorybun run validate:examplesbun run examples:oracle -- --inventorybun run lintbun run typecheckactionlintnot installed locally; workflow syntax was reviewed by inspection