Skip to content

Info-seeking active experiments, recurrent LM fit, and subgoal replanning#38

Merged
yichao-liang merged 201 commits into
masterfrom
sim-learning
Jun 12, 2026
Merged

Info-seeking active experiments, recurrent LM fit, and subgoal replanning#38
yichao-liang merged 201 commits into
masterfrom
sim-learning

Conversation

@yichao-liang

Copy link
Copy Markdown
Collaborator

Summary

Builds on the merged PO-boil work (#37) with active-experiment exploration, recurrent latent fitting, and closed-loop replanning for the sim-learning approach. Picks up the latest master (merged in; conflicts resolved keeping the newer branch APIs while preserving #37's non-conflicting changes).

Info-seeking / active experiments

  • Atom-disagreement scoring + ensembles (code_sim_learning/active_experiment.py): perturbation, posterior-subsample, and Laplace ensembles, plus mean-Bernoulli-entropy scoring over a parameter ensemble.
  • Pooled feasible-candidate proposal in bilevel_sketch.refine_sketch: with an info_scorer, subgoal-annotated steps draw candidates until info_n_feasible_target feasible ones are pooled (bounded by the per-node rollout budget) and propose the most-informative one, banking the rest as a ranked retry stock — no budget multiplication.
  • Wired through the explorer and learning approach behind agent_explorer_info_seeking, off by default (path unchanged when no scorer is supplied).

Recurrent fitting

  • Recurrent LM warm-start + bound-aware MAP fit and a FitResult carrying a Laplace bundle from parameter fitting; threaded through the synthesis tools so the agent scores latent rules through the same path.

Closed-loop test execution

  • subgoal_annotations execution monitor: replans when a finished step's subgoal annotation fails in the real state, capped by agent_bilevel_max_execution_replans.

Misc

  • Boil faucet outlet via a general rotation-matrix form; multi-object / per-object-latent synthesis prompting; assorted config and prompt updates.

Testing

All four CI checks were run locally against the CI-pinned tool versions:

  • mypy . --config-file mypy.ini (1.8.0, incl. --platform linux) — clean
  • pytest . --pylint (pylint 2.14.5) — 601 passed, fresh/uncached
  • yapf 0.32.0 / isort 5.10.1 / docformatter 1.4 — clean
  • pytest tests/ — 874 passed; the only failure is test_push_second_switch_boil_position_mode, a known PyBullet macOS↔Linux divergence that fails locally but passes on CI Linux (confirmed it fails identically on origin/master, which is green on CI).

Delegate option execution to option_model.get_next_state_and_num_actions
instead of duplicating its termination logic (stuck detection, Wait
atom-change checks) and directly accessing its simulator.
…inement

Extract the duplicated backtracking loop from run_low_level_search (SeSamE)
and _refine_sketch (agent bilevel) into a single run_backtracking_refinement
function in planning.py. Both callers now delegate to it with their own
sample_fn and validate_fn callbacks, eliminating ~80 lines of duplicated
loop/backtracking logic.
Replace 60 lines of manual option-model execution with a call to
run_backtracking_refinement using max_tries=[1] and a sample_fn that
returns the pre-grounded options. Remove unused Any import.
Move the _current_observation assignment into _reset_state so callers
don't need to remember the two-step pattern.  Clarify the relationship
between _current_observation (backing field) and _current_state (typed
read accessor) in docstrings and comments.
Adds agent_bilevel_plan_sketch_file setting that, when set to a file
path, loads the plan sketch directly from that file, bypassing the
foundation model query. Includes test data files and a unit test.
Extract repeated wait-termination check into _check_wait_termination helper
and unify the three _terminal branches into a single definition with
config checks inside the function body.
- Remove dead/commented-out code and stale self-question comments
- Add _VIRTUAL_OBJECT_TYPES constant to replace hardcoded type-name
  skip lists in _set_state and _get_state
- Move env-specific _get_robot_state_dict branches to subclass overrides
  in pybullet_cover and pybullet_blocks
- Extract _get_camera_matrices helper to deduplicate render methods
- Extract _get_object_state_dict from _get_state for per-object logic
- Move create_pybullet_block/sphere to pybullet_helpers/objects.py
- Merge _create_task_specific_objects into _set_domain_specific_state
- Rename: _reset_state -> _set_state,
  _reset_custom_env_state -> _set_domain_specific_state,
  _extract_feature -> _get_domain_specific_feature
- Add docstrings explaining where each method is called from
Reorganize methods into labeled sections (Setup, Public API, Core Loop,
State Write/Read, Grasp Management, Action Helpers, Rendering, Utilities)
so related functions are adjacent. Update module docstring to document
the main public API and state synchronization methods.
Add _step_base() and _domain_specific_step() to PyBulletEnv base class.
step() now calls _step_base (robot control, physics, grasp) then
_domain_specific_step (water filling, heating, etc.), gated by
_skip_domain_specific_dynamics flag for kinematics-only mode.

Migrate all 15 domain envs to override _domain_specific_step() instead
of step(). Envs with pre-step logic (coffee, switch, blocks, cover)
still override step() for the pre-step part only.
Document the step_base → domain_specific_step → get_observation flow,
_skip_domain_specific_dynamics flag, and _domain_specific_step as an
optional override.
Replace direct access to private _skip_domain_specific_dynamics
attribute with a public constructor parameter, so callers declare
kinematics-only mode at creation time instead of mutating internal
state after construction.
…ging

Both AgentSessionMixin and AgentExplorer had near-identical wrappers that
ran session.query() synchronously via nest_asyncio or asyncio.run. Move
that logic into a module-level run_query_sync helper in session_manager
and have both callers delegate to it.
Distinguishes the grounded-plan explorer from upcoming bilevel variants.
AgentExplorer -> AgentPlanExplorer, get_name() 'agent' -> 'agent_plan',
file moved to agent_plan_explorer.py, and all callers / docstrings /
YAML config examples updated accordingly.
The mixin is pure agent-session plumbing (session creation, lifecycle,
explorer factory) and has no approach-specific logic, so it belongs
next to session_manager.py, tools.py, and the sandbox managers rather
than in approaches/.
The explorer asks a Claude agent for a plan sketch, refines it against
the approach's current (possibly learned) option model, and rolls the
refined plan out in the real env. When the mental model disagrees with
reality — e.g. the sketch expects JugFilled after a Wait but the mental
model's process dynamics can't produce it — the explorer truncates the
plan at the deepest unsatisfiable subgoal (inclusive) so the real-env
rollout ends exactly where the disagreement occurs, maximising signal
per experiment.

Key pieces:

- predicators/agent_sdk/bilevel_sketch.py: extracted the sketch build
  / parse / refine helpers from AgentBilevelApproach as module-level
  functions so both the approach (solve path) and the new explorer
  (exploration path) can share them. refine_sketch gains
  truncate_on_subgoal_fail: the on_step_fail callback snapshots the
  deepest subgoal failure seen during backtracking, and on exhaustion
  the captured prefix is returned as the experiment plan.

- predicators/explorers/agent_bilevel_explorer.py: new explorer.
  Reads option_model from tool_context (synced by the approach),
  builds the sketch prompt via bilevel_sketch, runs refine_sketch with
  check_subgoals=True, check_final_goal=False, truncate_on_subgoal_fail
  =True, wraps the result in an option_plan_to_policy that converts
  OptionExecutionFailure into RequestActPolicyFailure so the episode
  cleanly terminates at the point of real-env divergence. Stashes the
  sketch subgoals/options on ToolContext for downstream diffing by
  the learning approach.

- predicators/approaches/agent_bilevel_approach.py: shim methods over
  bilevel_sketch; behaviour unchanged.

- predicators/approaches/agent_planner_approach.py: _create_explorer
  dispatches both "agent_plan" and "agent_bilevel" through the agent
  factory path and forwards CFG.explorer as the name.

- predicators/explorers/__init__.py: factory branch merged for the
  two agent-session-backed explorers.

- predicators/agent_sdk/tools.py: ToolContext gains
  last_sketch_subgoals / last_sketch_options fields, populated by the
  explorer and marked TODO for the learning approach to consume.

- tests/explorers/test_agent_bilevel_explorer.py: happy-path, fallback,
  wait-memory-injection, and deepest-subgoal-failure truncation tests.
- New setting agent_bilevel_explorer_max_samples_per_step (default 50),
  separate from the solve-path budget, so the explorer's backtracking
  cost is independently tunable.
- Log the actual experiment plan (option names, objects, params) after
  refinement so the explorer's output is visible alongside the
  existing sketch/truncation log lines.
- Test config updated to set both budgets explicitly.
AgentSimLearningApproach extends AgentBilevelApproach to learn process
dynamics online. Each cycle: the agent synthesizes parameterized
process rules via Claude (using run_python / evaluate_simulator /
test_simulator MCP tools), parameters are fitted via emcee MCMC, and
the learned dynamics are composed with a kinematics-only PyBullet
oracle into a combined option model for plan refinement.

Key pieces:
- predicators/approaches/agent_sim_learning_approach.py: the approach.
  Initialises with a kinematics-only option model (so
  AgentBilevelExplorer sees disagreements at process-dynamic subgoals
  like JugFilled/Boiled), and replaces it with the kin+learned model
  after each successful synthesis cycle.
- predicators/agent_sdk/tools.py: create_synthesis_tools() builds the
  three MCP tools the synthesis agent uses; extra_mcp_tools field and
  get_allowed_tool_list(extra_names=) plumbing lets the approach
  inject them into the session.
- predicators/code_sim_learning/: ParamSpec, fit_params (emcee MCMC),
  compute_mse, LearnedSimulator.
- predicators/ground_truth_models/boil/gt_simulator.py: ground-truth
  process-dynamics simulator for the boil environment.
- tests/: approach and param-fitting tests.
- agents.yaml: comment out agent_bilevel preset, add agent_sim_learning
  with explorer=agent_bilevel and skip_test_until_last_ite_or_early_stopping.
- common.yaml: disable failure/test video recording, set
  num_online_learning_cycles=1 for faster iteration.
Simulation primitives (code_sim_learning/utils.py):
- apply_rules(state, rules, params) → ProcessUpdate
- merge_updates(base_state, updates, process_features) → State
- simulate_step(state, action, base_env, rules, params, features) → State
These replace _build_fitted_step_fn, merge_process_updates,
_sim_fn_from_rules, and the body of _build_combined_simulator.

GT simulator factory (ground_truth_models):
- GroundTruthSimulatorFactory ABC + get_gt_simulator(env_name) discovery,
  following the existing get_gt_options / get_gt_nsrts pattern.
- PyBulletBoilGroundTruthSimulatorFactory registered in boil/.
- Replaces the hardcoded _load_oracle_simulator in the approach.

Oracle ablation flags (settings.py):
- agent_sim_learn_oracle_sim_program: load GT rules, skip synthesis.
- agent_sim_learn_oracle_sim_params: use GT param values, skip MCMC.

Also: kin_env → base_env rename throughout, redundant self._types
assignment removed, process_features computed once in __init__.
- yapf + isort autoformatting applied to all touched files.
- pylint: fix logging-not-lazy in agent_bilevel_explorer, add
  broad-except and reimported disables in agent_sim_learning_approach.
- mypy: fix base/env variable name collision, add type: ignore on
  lambda inference, add return type annotations to GT factory methods.
Use utils.abstract to evaluate expected atoms in low-level search so
that DerivedPredicates — which require a Set[GroundAtom] rather than a
State — are handled correctly alongside regular predicates.
When sequential simulate calls differ only in process features (as in
the combined kinematic+learned simulator), reapplying joint positions
and tearing down/recreating grasp constraints causes visible arm
jitter. Compare robot poses first and skip the kinematic reset path
when they already match.
Factor simulator synthesis into a shared _learn_simulator helper so
that both learn_from_offline_dataset and learn_from_interaction_results
can trigger it on their respective trajectory sources. Also create a
separate headless env for parameter fitting so MCMC's thousands of
_set_state calls don't thrash the GUI env during training.
…ate_invention

Renames the recurrent partial-observability predicate-invention approach
file and its class (AgentSimRecurrentPredicateInventionApproach ->
AgentPOSimPredicateInventionApproach), updating all references across
settings, structs, agent_bilevel, utils, the predicatorv3 agents config,
and tests.
The synthesis tools (evaluate_step_fit, report_residuals) scored rules
through the legacy per-transition path (apply_rules, 3 args), while the
fitting engine calls recurrent rules with 5 args (apply_rules_with_latent
via has_latent_rules dispatch). So when the agent wrote the correct
5-arg signature the tool rejected it and steered the agent to a broken
3-arg rule, which then crashed the engine ("takes 3 positional
arguments but 5 were given").

- Add rollout_predictions() and route both tools through has_latent_rules
  dispatch: recurrent rules now score with the latent threaded per
  trajectory via the shared _fit_parameters_latent / compute_sse_recurrent
  path the engine uses. _snapshot_and_load now surfaces LATENT_INIT.
- Remove a duplicated synthesis-prompt block (bad-merge artifact that also
  double-injected the recurrent section) and template the rule-signature
  example: fully-observable keeps the 3-arg form, the PO subclass shows
  only the recurrent 5-arg signature (no 3-arg references).
- Add tests for rollout_predictions and FO/PO prompt rendering.
The (roll, tilt, wrist) Euler triple jointly encodes a free SO(3)
orientation, so an axis-by-axis state-reconstruction check is degenerate
at gimbal lock (tilt=±π/2): equivalent gimbal branches report up to π of
spurious per-axis error on the same physical orientation, which surfaced
as noisy "Could not reconstruct state exactly" warnings on robot.roll /
robot.wrist.

Add _ORIENTATION_EULER_TRIPLES and _euler_orientation_angle (geodesic
angle between unit quaternions) and compare the triple as a single
rotation, excluding its axes from the per-axis pass. The residual now
surfaces as one small <orientation> angle instead of misleading per-axis
rows. Adds gimbal-lock tests.
Large MCP tool results returned inline were truncated by the agent SDK and
dumped to ~/.claude/projects/.../tool-results/ (outside the sandbox), then
the agent was instructed to read that host path -- the one out-of-sandbox
access observed in the boil predicate-invention runs.

- Add _make_spilling_text_result and route all three tool factories through
  it: results over ~30k chars now spill to <sandbox>/tool_outputs/ with a
  head/tail preview, so nothing is dumped outside the sandbox. inspect_*
  (create_mcp_tools) previously had no spill; run_python already did.
- Add _screen_text_for_sandbox_escape and a matching self-contained Bash
  screen in VALIDATE_SANDBOX_SCRIPT (matcher now includes Bash): reject
  absolute / .. paths resolving outside the sandbox and predicators-source
  introspection. run_python is screened in-tool (the file-path hook does not
  cover MCP tools); Bash is screened by the hook.

Heuristic, not a hard boundary (subprocess/env/computed paths can still
escape; OS isolation remains the real boundary). Verified against all 64
historical tool calls in the logs: only the 3 seed3 leak reads are blocked,
zero false positives on legitimate calls.
The 'Refinement vs. forward validation' pitfall examples in the synthesis
system prompt named heat_level, the heat rule, jug-to-burner gating, and
WaterBoiled — leaking the pybullet_boil latent's name and causal structure
to the agent during model synthesis. Rewrite both using the generic
widget/fixture/WidgetReady/process_value vocabulary already used elsewhere
in the prompt, preserving the lessons unchanged.
During bilevel refinement the option model backtracks by resetting the
PyBullet env to a search node's state. Features derived from a hidden
sim-feature (e.g. bubbling_level read out from heat_level) cannot be
reconstructed from an observation-only State, so they come back at their
default (0). A learned rule that reads its own emitted observable back as
input (a latch) then silently loses state, making otherwise-valid plans
unrefinable — even though a continuous forward rollout works.

PyBulletEnv._set_state now records the (object, feature) pairs it could
not round-trip (_last_unreconstructible_features, via a structured
_reconstruction_mismatch_features helper); it is cleared on sequential
rollouts where no reset happens. The agent-sim combined simulators call a
new _restore_unreconstructible_process_features that overwrites exactly
those features (intersected with the declared PROCESS_FEATURES) with the
carried value before the rules run. Scoping to the env-reported lossy set
leaves base-reconstructible co-owned features (e.g. a robot-movable,
wind-blown x,y) untouched, so this does not freeze them.
Tell the synthesis agent to keep any state carried across steps (counters,
accumulated levels, irreversible "done" flags) in the threaded `latent`
block, and to treat emitted observables as outputs only — recomputed from
`latent` each step, never read back as input. Only `latent` is guaranteed
to survive the planner's state resets during refinement, so a rule that
latches on its own emitted feature passes a step-by-step rollout yet
breaks at refinement time. Kept general (no env-specific names) and points
at the existing Pattern A/B examples, which already follow it.
The agent_bilevel explorer previously refined with check_final_goal=False
and reported "solved" purely from real-env execution, so a learned model
that produces an executable plan but mispredicts the goal could trigger
early stopping despite being unable to plan to the goal in its own model.

Now the explorer refines with check_final_goal=True and records whether
the mental model reached the task goal. refine_sketch's
truncate_on_subgoal_fail additionally captures a final-goal failure
(renamed deepest_subgoal_fail_* -> deepest_fail_*), so a goal the model
predicts won't hold still runs end-to-end in reality as an experiment
rather than being dropped. The verdict rides ToolContext to
get_interaction_requests, which stamps InteractionRequest.mental_model_solved;
main._generate_interaction_results treats a False verdict as not-solved
for early stopping (None = no verdict, so other explorers are unchanged).
Replace the pybullet_boil/`heat_level` examples in the State.data and
State.latent docstrings with environment-agnostic wording, matching the
existing effort to keep core structs free of boil-specific leakage.
The switch envs define "fully on" as joint_scale * jointUpperLimit (~10% of
the joint's URDF travel) but leave the prismatic joint free, so a gripper
push can over-extend the slider into the remaining travel. From there the
reverse push can no longer drag it back across the on/off threshold -- e.g.
in boil, SwitchBurnerOn over-pushes the switch to frac~1.5 and the later
SwitchBurnerOff then fails to turn it off, leaving BurnerOff unsatisfied.
Forward-validation masked this because the switch is excluded from the
observable state and reconstruction resets snap the joint back to the
canonical on-position (frac=1.0), from which the off-push works.

Add cap_switch_joint_travel (pybullet_helpers/objects.py): a changeDynamics
upper limit at joint_scale * jointUpperLimit so "fully on" coincides with the
joint's physical stop. changeDynamics is invisible to getJointInfo, so each
env's frac readout (on=1.0 / off=0.0 / threshold=0.5) is unchanged -- only
the unreachable over-extension headroom is removed. It is a no-op for
switches that are only toggled programmatically.

Applied at switch creation in boil, laser, switch, magic_bin, barrier, and
fan (fan's setJointMotorControl2 drives the fan blades, not the switches).
Give every PyBullet env a "studio room" look -- muted floor, warm backdrop
walls, wood table texture, a directional key light with contact shadows, and
a neutral GUI background -- instead of the flat default scene. The backdrop
room and key-light direction are derived from each env's camera, so the look
adapts automatically; an env can override any piece via class vars or opt out
with _use_studio_visuals = False.

It is applied through the base PyBulletEnv (initialize_pybullet / render /
__init__), so every env using the shared setup gets it; only domino needed its
two-table initialize_pybullet updated (now via super()). The rendering
machinery lives in a new pybullet_helpers/studio_visuals.py module, leaving the
env classes with just the per-env-overridable studio config. Wall textures are
generated by scripts/generate_room_textures.py.
Two CFG knobs let agent_planner run as a model-free or base-sim
baseline against the world-model learner:

- agent_planner_use_simulator (default True): when False, the planner
  gets no option model, so test_option_plan and the scene-rendering
  tools (visualize_state/annotate_scene) are withheld and the prompt
  shifts to open-loop framing -- it must plan from trajectory data and
  LLM reasoning alone.
- agent_planner_use_base_simulator (default False): when a simulator is
  used, wraps the base env (skip_process_dynamics=True) instead of the
  real one, denying the delayed _domain_specific_step dynamics.

create_option_model gains a skip_process_dynamics passthrough (forwarded
only when True, so non-PyBullet analog envs are unaffected).
docker_agent_runner honors the base-sim flag on its in-container
rebuild. agent_bilevel asserts a non-None option model. Defaults
reproduce existing behavior.
docformatter 1.4 wanted re-wraps of the genericized latent docstrings in
structs.py/utils.py. mypy flagged AgentAbstractionLearningApproach because
AgentPlannerApproach now types _option_model as Optional (it genuinely can
be None on the model-free path) while BilevelPlanningApproach types it
non-Optional; suppress the unavoidable diamond-merge [misc] error.
run_refinement_for_synthesis (backing the evaluate_plan_refinement tool)
was left on the fully-observable 3-arg path when PO/recurrent support was
added. A 5-arg latent-declaring rule was therefore fit via the legacy
per-transition fitter and rolled through a combined simulator built from
stale self._process_rules -- both calling the rule with 3 args, which
pushed synthesized rules into defensive dual-convention boilerplate.

Dispatch the fit on has_latent_rules (recurrent fit for latent rules),
and publish the candidate rules/latent_init onto the approach before
building the combined simulator so it validates the candidate rules with
the 5-arg convention. Thread latent_init through the tool wrapper instead
of discarding it. This matches the signature-based dispatch every other
call site already uses.
Centralize the faucet-outlet computation (used by both the JugAtFaucet
fill check and the spill block) into a shared _faucet_outlet_xy helper
that uses outlet = faucet + R(rot) @ (local_dx, local_dy) -- the same
rotation-matrix parameterization the learned simulators use -- instead
of the duplicated single-distance-along-(cos, -sin) special case.

Behavior-identical at the faucet's fixed rot=pi/2 (outlet stays at the
true (faucet_x, faucet_y - faucet_x_len)); the general form lets the
env's true model sit inside the learner's hypothesis class.
The recurrent (partially-observable) fitter previously had no LM path and
hardcoded a theta>0 constraint in emcee, so signed parameters (e.g. a
faucet local-frame offset whose true value is negative) could not be
represented, and with MCMC disabled nothing fit the params at all.

Two changes:

* Replace the blanket theta>0 in both emcee fitters with each ParamSpec's
  declared [lo, hi] box (factored into a shared _param_bounds helper the LM
  path also uses), and make the Gaussian prior width robust to negative /
  zero inits (_prior_widths uses |init| with a bound-range fallback).
  Signed parameters that declare a negative lo are now fittable.

* Add compute_residuals_recurrent (rollout residual vector, fixed obj x feat
  order so the Jacobian stays well-formed across hard-gate flips;
  sum(residuals**2) == compute_sse_recurrent by construction) and
  fit_map_lm_recurrent, then wire the LM warm-start / Hessian identifiability
  diagnostic into fit_params_recurrent behind the same CFG flags as the FO
  path. With num_mcmc_steps=0 the recurrent path now returns the LM MAP
  instead of raw init, and the diagnostic surfaces hard-gated, data-flat
  parameters as unidentifiable rather than passing them through silently.
Synthesis prompts assumed a single object per type, so the agent wrote
rules indexing jugs[0]/faucets[0] and a flat {"heat": 0.0} latent that
break with multiple same-type objects.

- Shared base prompt: add a 'Multiple objects of the same type' section
  telling rules to gather by type and loop over all bindings, never a
  fixed slot, with shared params across instances.
- Recurrent (PO) prompt: add a 'Structure the latent like the state'
  section — shape the latent object-first ({obj.name: {feature: value}})
  to mirror data, while keeping it a free-form name-keyed dict (not a
  typed array) and global latents as top-level scalars.
Fit entry points now hand back the FitResult itself (callers read
.point_estimate) and attach the LM Jacobian, noise sigma, and prior
sigma at the MAP whenever the LM prefit ran, so a Laplace posterior
covariance can be built without re-deriving it. Adds the
agent_explorer_info_* settings consumed by the prefit gate.
New code_sim_learning.active_experiment module: build a parameter
ensemble from the fit (posterior subsample, Laplace draw from the LM
Jacobian, or uniform-jitter fallback) and score candidate states by
ensemble disagreement on subgoal atoms, turning continuous-parameter
search into information seeking.
refine_sketch accepts an optional info_scorer: subgoal-annotated steps
pool up to info_n_feasible_target feasible parameter samples within the
existing per-node rollout budget, propose them best-first by ensemble
disagreement, and replay the ranked remainder across backtracking
retries without new rollouts. No scorer means first-feasible search,
unchanged.
The sim-learning approach builds a calibrated parameter ensemble after
each fit (posterior subsample > Laplace > uniform jitter) and exposes
score_atom_disagreement; the planner syncs it into the tool context and
the agent_bilevel explorer hands it to refinement as the info scorer,
with experiment guidance naming the most-disagreed predicates in the
explore prompt. Enable via agent_explorer_info_seeking in the renamed
agent_po_predicate_invention_al experiment.
All five failed boil test episodes (AL seed0/seed1) shared one mode:
the real Place drop-settle landed the jug outside the burner-align
radius while the option-model rollout predicted on-target, and the
open-loop plan then burned the 500-step horizon waiting for a boil
that could not happen. Forward validation only proves a plan works in
the option model, so divergence has to be caught at execution time.

With agent_bilevel_max_execution_replans > 0, test execution is now
closed-loop, built on the repo's standard cogman monitoring framework:

- A new subgoal_annotations execution monitor checks the just-finished
  step's sketch annotation at the exact option boundary (it evaluates
  the live option's terminal condition itself, so detection is not one
  env step late) and suggests replanning on divergence.
- AgentBilevelApproach exports a live SubgoalExecutionStatus via the
  existing get_execution_monitoring_info hook; the dispensed policy
  just executes and reports progress. CogMan's standard replan path
  re-invokes solve(), which lands in _maybe_replan_from_divergence:
  it resumes a re-refined suffix of the executed sketch from the
  current state (walking back from the failed step, bounded by the
  latest still-holding annotation, each candidate forward-validated),
  and only falls back to a fresh agent sketch when no suffix
  validates.
- A new BaseApproach.reset_for_new_episode hook, called from
  CogMan.reset, distinguishes the episode-start solve from mid-episode
  re-solves and keeps the recovery budget — shared across chained
  replans — as a plain per-episode instance counter. Once the budget
  is exhausted, the next divergence raises ApproachFailure so the
  episode fails fast instead of burning the horizon open-loop.
- Construction-time check: enabling the replan budget without
  --execution_monitor subgoal_annotations is a config error, since
  detection lives in the monitor.
Subgoal annotations became runtime contracts with the execution-replan
change: refinement validates each annotated step, execution monitoring
checks them against the real state, and suffix replanning anchors its
walk-back on them — all blind at unannotated steps. Update the sketch
prompts to ask for an annotation on every expressible step (preferring
atoms that newly change, since already-true atoms cannot reveal
divergence), give predicate invention an effect-coverage objective (an
unannotatable step signals a missing predicate), and log per-sketch
annotation coverage when parsing.
# Conflicts:
#	predicators/agent_sdk/bilevel_sketch.py
#	predicators/agent_sdk/tools.py
#	predicators/approaches/agent_po_sim_predicate_invention_approach.py
#	predicators/approaches/agent_sim_learning_approach.py
#	predicators/explorers/agent_bilevel_explorer.py
#	scripts/configs/predicatorv3/agents.yaml
#	tests/code_sim_learning/test_training.py
@yichao-liang yichao-liang merged commit bd3fbf9 into master Jun 12, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant