Skip to content

fix: loop substrate hardening — worktree self-heal + prod handoff allowlist#5

Merged
Svaag merged 2 commits into
mainfrom
fix/worktree-setup-self-heal
Jun 15, 2026
Merged

fix: loop substrate hardening — worktree self-heal + prod handoff allowlist#5
Svaag merged 2 commits into
mainfrom
fix/worktree-setup-self-heal

Conversation

@Svaag

@Svaag Svaag commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Context

Two substrate bugs that each blocked the production docs-only canary (AS215932/network-operations#235) on the dedicated loop VM, surfaced while finishing the Phase A loop cutover (after engineering-loop#4 provider-env and network-operations#237 memory-location fixes). Both must land before the hourly timer is enabled.

1. Self-heal stale worktrees (promotion.py)

setup_worktrees_for_state aborted with worktree path already exists whenever a worktree — or just its branch — from a prior crashed run was still on disk. The per-invocation rollback only removes worktrees created in the same call, so a single crash permanently wedged every future run for that change_id. Fatal once the timer runs hourly (had to be hand-cleaned on loop to unblock the canary).

  • Clean any leftover worktree + branch (no-op on a clean tree) and git worktree prune before recreating, instead of raising.
  • _worktree_branch_registered detects a leftover branch even when the directory is gone (that also breaks worktree add -b).

2. Allowlist the production handoff dir (engineering-loop-policy.yml)

After worktree setup was unblocked, a fully executed run (≈6 min of real implementation) still dead-ended at needs_triage:

handoff output directory is not allowlisted: /var/lib/engineering-loop/runs/.../handoff

The daemon writes per-change handoffs under its --output-root (/var/lib/engineering-loop/runs), but the shipped gate policy only allowlisted /tmp (the CI/test default, TMPDIR=/tmp).

  • Add /var/lib/engineering-loop/runs to allowed_handoff_dirs (keep /tmp for CI).

Validation

  • uvx ruff check src tests — clean
  • uv run --group dev mypy --strict src — clean
  • uv run --group dev pytest -q161 passed (two new regression tests: worktree self-heal; shipped-policy prod handoff allowlist)

Rollout

Merge → bump engineering_loop_version to this SHA in network-operations host_vars/loop.ymlapply.yml to loop → re-run canary #235 (expect one docs-only draft PR) → only then enable the timer (plan A3).

setup_worktrees_for_state raised "worktree path already exists" when a
worktree (or its branch) from a prior *crashed* run was still on disk.
The per-invocation rollback only removes worktrees created in the same
call, so a single crash permanently wedged every future daemon run for
that change_id — fatal once the hourly timer is enabled.

Now clean any leftover worktree + branch (no-op on a clean tree) and
prune before recreating. Detect a leftover branch even when the worktree
directory is already gone, since that would also break `worktree add -b`.

Surfaced by the production docs-only canary (network-operations#235) on
the loop VM, which wedged on a worktree left by an earlier failed run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Svaag Svaag added the agentic-isp AS215932/Hyrule agentic ISP operating-loop work label Jun 15, 2026
The daemon on the loop VM writes per-change handoffs under its
--output-root (/var/lib/engineering-loop/runs), but the shipped gate
policy only allowlisted /tmp (the CI/test default). So a fully executed
run dead-ended at needs_triage with "handoff output directory is not
allowlisted" and never published a draft PR.

Add /var/lib/engineering-loop/runs to allowed_handoff_dirs (keeping /tmp
for CI). New test loads the shipped policy and asserts the production
handoff path is allowlisted while an off-allowlist path is still rejected.

Surfaced by the production docs-only canary (network-operations#235).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Svaag Svaag changed the title fix(promotion): self-heal stale worktrees instead of wedging the loop fix: loop substrate hardening — worktree self-heal + prod handoff allowlist Jun 15, 2026
@Svaag Svaag marked this pull request as ready for review June 15, 2026 14:22
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@Svaag Svaag merged commit 58735fa into main Jun 15, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agentic-isp AS215932/Hyrule agentic ISP operating-loop work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant