fix: loop substrate hardening — worktree self-heal + prod handoff allowlist#5
Merged
Conversation
setup_worktrees_for_state raised "worktree path already exists" when a worktree (or its branch) from a prior *crashed* run was still on disk. The per-invocation rollback only removes worktrees created in the same call, so a single crash permanently wedged every future daemon run for that change_id — fatal once the hourly timer is enabled. Now clean any leftover worktree + branch (no-op on a clean tree) and prune before recreating. Detect a leftover branch even when the worktree directory is already gone, since that would also break `worktree add -b`. Surfaced by the production docs-only canary (network-operations#235) on the loop VM, which wedged on a worktree left by an earlier failed run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The daemon on the loop VM writes per-change handoffs under its --output-root (/var/lib/engineering-loop/runs), but the shipped gate policy only allowlisted /tmp (the CI/test default). So a fully executed run dead-ended at needs_triage with "handoff output directory is not allowlisted" and never published a draft PR. Add /var/lib/engineering-loop/runs to allowed_handoff_dirs (keeping /tmp for CI). New test loads the shipped policy and asserts the production handoff path is allowlisted while an off-allowlist path is still rejected. Surfaced by the production docs-only canary (network-operations#235). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Two substrate bugs that each blocked the production docs-only canary (
AS215932/network-operations#235) on the dedicatedloopVM, surfaced while finishing the Phase A loop cutover (afterengineering-loop#4provider-env andnetwork-operations#237memory-location fixes). Both must land before the hourly timer is enabled.1. Self-heal stale worktrees (
promotion.py)setup_worktrees_for_stateaborted withworktree path already existswhenever a worktree — or just its branch — from a prior crashed run was still on disk. The per-invocation rollback only removes worktrees created in the same call, so a single crash permanently wedged every future run for thatchange_id. Fatal once the timer runs hourly (had to be hand-cleaned onloopto unblock the canary).git worktree prunebefore recreating, instead of raising._worktree_branch_registereddetects a leftover branch even when the directory is gone (that also breaksworktree add -b).2. Allowlist the production handoff dir (
engineering-loop-policy.yml)After worktree setup was unblocked, a fully executed run (≈6 min of real implementation) still dead-ended at
needs_triage:The daemon writes per-change handoffs under its
--output-root(/var/lib/engineering-loop/runs), but the shipped gate policy only allowlisted/tmp(the CI/test default,TMPDIR=/tmp)./var/lib/engineering-loop/runstoallowed_handoff_dirs(keep/tmpfor CI).Validation
uvx ruff check src tests— cleanuv run --group dev mypy --strict src— cleanuv run --group dev pytest -q— 161 passed (two new regression tests: worktree self-heal; shipped-policy prod handoff allowlist)Rollout
Merge → bump
engineering_loop_versionto this SHA innetwork-operationshost_vars/loop.yml→apply.ymltoloop→ re-run canary#235(expect one docs-only draft PR) → only then enable the timer (plan A3).