Skip to content

PR #67 follow-ups: run-lifecycle unification, heartbeat recovery, sys.exit deep fix, green CI#71

Merged
FlyM1ss merged 12 commits into
mainfrom
pr67-followups
Jul 5, 2026
Merged

PR #67 follow-ups: run-lifecycle unification, heartbeat recovery, sys.exit deep fix, green CI#71
FlyM1ss merged 12 commits into
mainfrom
pr67-followups

Conversation

@FlyM1ss

@FlyM1ss FlyM1ss commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator

Closes out every open follow-up recorded on docs/reviews/PR67-FIX-CHECKLIST.md from the PR #67 review cycle. PR #67 merged while this pass was in flight, so these land as their own PR (upstream main, including the #69/#70 landing refreshes, is merged in — the shipped bundle was rebuilt from the merged source so the brand changes and the dead-component prune ship together).

What's here

Run-lifecycle unification (H4 follow-ups) — v2 runs now write protocol_runs rows through their whole lifecycle, which yields three fixes with one mechanism:

  • Unified per-agent active-run cap: both surfaces count run_store.count_active_runs() under the shared v1 create lock — an agent can no longer hold 2× MAX_ACTIVE_RUNS_PER_AGENT by splitting across v1+v2.
  • Reaper coverage for v2: each reaper pass drives a registered v2 sweep (register_reaper_sweep, wired in the composition root so domain/ never imports api/) that drains abandoned runs through elapsed deadlines and swaps terminal backends for DB-backed ArchivedBacktestBackend tombstones — market-data buffers freed while status/result/decisions/idempotent-replay stay serviceable.
  • Restart visibility: startup orphan-recovery covers v2 rows for free, and the v2 API rehydrates terminal rows (owners see failed instead of a 404; the anti-oracle 404 shape for foreign probes is preserved and pinned).

Multi-worker recovery hardeningprotocol_runs gains owner_instance/heartbeat_at (migrated in place, ALTER race-tolerant); the reaper heartbeats every run whose session is live in this process; fail_stale_runs() (RUN_HEARTBEAT_STALE_SECONDS, default 300s) only fails runs nobody heartbeats — a sibling worker's live runs are safe, unlike the blanket startup UPDATE (which stays, for single-process).

sys.exit() deep fix (B0 class)AlpacaDataLoader/BaselineGenerator/HourlyBacktester.load_data raise MarketDataUnavailableError (plain exception) instead of exiting; the backtest CLIs translate it to exit-code 1 at the process boundary. The (Exception, SystemExit) guards remain as defense-in-depth.

SQLite WAL on the shared DB file (both connection layers, best-effort with fallback).

agent_runs.metadata JSON column — records the effective LLM_MAX_OUTPUT_TOKENS per LLM run (init + migrate paths, parsed consistently by every reader).

Green CI end to end — the 5 pre-existing failures were stale expectations (public /runs listing is a deliberate 2026-05-25 product decision; /runs/latest/metrics filters agent_name == 'Agent'; the safe prompt was rewritten). Tests re-pinned to the real contracts; the active prompt regained a compact no-tools constraint line (tightens, never loosens — enforcement stays response-side in validate_llm_response). Suite: 837 passed / 0 failed; packaging 41.

Landing — dead pages/not-found.tsx + ui/card.tsx deleted; shipped bundle rebuilt from the merged source (tsc + vite green, headless smoke: renders, auth modal, 0 console errors).

Docs/CI_owner_context session-credential caveat documented; CI DATABASE_PATH backstop; checklist updated with the full follow-up record.

Review

A 50-agent adversarial workflow (5 lens finders + 3-vote refute panels over 15 findings) reviewed the full diff; 12 confirmed findings were all resolved before this PR — including a critical cancel-clobber (a cleanup cancel racing the final decision downgraded a completed run's ledger row to closed, hiding its metrics). 30+ new red-green tests across test_run_lifecycle_unification.py, test_market_data_errors.py, test_agent_runs_metadata.py, test_sqlite_wal.py.

🤖 Generated with Claude Code

https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

FlyM1ss and others added 12 commits July 5, 2026 10:04
…ipped bundle

pages/not-found.tsx was fully dead (App.tsx renders its own inline 404) and
ui/card.tsx survived the LOW sweep only as its import — both deleted; tsc +
vite build verified green.

The committed frontend bundle was still the PRE-prune build: its CSS
(107.76 kB) predates the 65-file component deletion (3618b6a). Swapped in the
fresh assets (CSS 31.24 kB, JS byte-identical) and updated index.html's
references, dropping the now-redundant ?v= cache-busters (filenames are
content-hashed and vercel.json serves /assets/* immutable, / no-cache).
Headless smoke: root renders, auth modal present, no console errors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…TH backstop in CI

B0/B1 follow-ups: _owner_context's X-Session-Id fallback makes session_id an
ownership credential for import-created agents — document that 'session_id is
never a credential' only holds for built-in agents. CI gets an explicit
DATABASE_PATH env so a conftest isolation regression can never write into the
committed seed DB.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
H4 follow-up. RunStore (protocol steps) and BacktestDatabase (equity/trades)
share one file; under the rollback journal a finalize's heavy writes block
every reader. WAL lets readers proceed during commits — the actual concurrency
shape here. Best-effort with fallback for filesystems without shm support;
journal_mode persists in the file so one switch at construction covers all
later connections from either layer (RunStore's own _enable_wal lands with the
lifecycle-unification commit that reworks that module). Verified on WSL drvfs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
… sweep, heartbeat recovery

Closes the three H4 reconciliation follow-ups in one mechanism: v2 runs now
write protocol_runs rows through their whole lifecycle (create=loading,
load-failure/finalize/cancel update the row), so

* the per-agent active-run cap counts run_store.count_active_runs() on BOTH
  surfaces under the SHARED v1 create lock — an agent can no longer hold
  2× MAX_ACTIVE_RUNS_PER_AGENT by splitting across v1+v2. The v2 create path
  first reconciles its own finished backends (passive is_active() peek only;
  live runs are never touched under the lock);
* the reaper reaches v2: reap_runs() drives sweeps registered via
  register_reaper_sweep() (composition root registers reap_v2_runs — domain
  never imports api/*). The sweep drains abandoned v2 runs through elapsed
  deadlines and swaps terminal backends for a DB-backed
  ArchivedBacktestBackend tombstone: market-data buffers freed, while
  status/result/decisions/idempotent-replay stay serviceable — and v2's
  _require_run rehydrates terminal rows post-restart (owners see 'failed'
  instead of a 404; anti-oracle 404 shape preserved for foreign probes);
* startup orphan recovery now covers v2 rows for free, and multi-worker
  deployments get heartbeat-based recovery: rows carry
  owner_instance/heartbeat_at (schema migrated in place), every reaper pass
  heartbeats runs whose session is live in this process, and
  fail_stale_runs() only fails runs nobody heartbeats
  (RUN_HEARTBEAT_STALE_SECONDS, default 300s; blanket
  RUN_RECOVERY_ON_STARTUP stays for single-process).

16 new tests (tests/test_run_lifecycle_unification.py) pin the cross-surface
cap both directions, sweep archival + deadline draining, restart rehydration,
oracle shape, heartbeat stamping/refresh, stale recovery, and the legacy
schema migration. FakeBackend gained an honest is_active();
test_active_run_count_is_side_effect_free re-pinned to the ledger-based
counting (passivity invariant unchanged).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…t() (B0 deep fix)

The class fix behind B0 and its recurring guard sites: AlpacaDataLoader
(missing creds / missing alpaca-py), BaselineGenerator (same) and
HourlyBacktester.load_data (no data) used to sys.exit(1). SystemExit is a
BaseException — it evaded every 'except Exception' server boundary, silently
killed daemon loader threads, and wedged the ASGI loop. They now raise
MarketDataUnavailableError (a plain RuntimeError subclass); the CLI
entrypoints (backtest_hourly_agent, backtest_custom_algo) translate it to
exit code 1 at the process boundary.

The existing (Exception, SystemExit) guards stay as defense-in-depth; their
comments now describe them as such. app.py's paper-baseline init, the legacy
/backtest/run background thread, and the algo-service subprocess path are
covered by their existing 'except Exception' handlers with no local changes.

Tests: tests/test_market_data_errors.py (red-green) pins raise-not-exit for
all three libraries, the plain-Exception contract, and the CLI boundary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…_TOKENS

LOW-sweep residual: the per-request output-token ceiling is an env knob that
changes a run's spend/behavior, but nothing recorded which value was in
force. agent_runs gains a JSON metadata column (added in BOTH _init_schema
and _migrate_schema, per the schema convention); insert_run accepts an
optional metadata dict; every agent_runs reader returns it parsed (one
_parse_run_row helper) so listings and get_run agree. The engine's LLM
agent run records {llm_max_output_tokens: <effective value>}; rule-based
runs record nothing. 6 red-green tests incl. legacy-schema migration.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
The new CI ran RED on 5 failures that predate this PR (they fail on main
identically). All five pinned contracts the product deliberately no longer
has:

* GET /runs is a PUBLIC listing (route docstring; the dashboard home page
  fetches it session-less for the seed runs) — the strict per-session
  listing tests now pin the real contract: public metadata listing,
  session-scoped per-run data access, middleware 400 on malformed session
  ids for scoped routes (fixture also now patches the backtests router's
  import-time db binding, which the old tests silently missed).
* /runs/latest/metrics filters agent_name == 'Agent' (the internal hourly
  agent); the old fixture's 'Agent A'/'Agent B' rows could never match.
* create_safe_prompt's strategy body was rewritten; the old CANNOT-phrased
  constraint sentences only exist in a retired commented-out prompt. The
  active prompt regains a compact no-tools constraint line (parity with
  CUSTOM_STRATEGY_OUTPUT_CONTRACT — tightens, does not loosen) and the test
  pins the current wording; the hard boundary remains response-side
  validate_llm_response enforcement (unchanged, still pinned).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…ise contract

test_missing_credentials_exits pinned the retired sys.exit(1) behavior; it
now pins MarketDataUnavailableError (companion to the B0 deep-fix commit).
Full backend suite: 829 passed, 0 failed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…fication

A 50-agent review (5 lenses, 3-vote refute panels) of the follow-ups diff
confirmed 12 findings; the real defects, all fixed red-green:

* CRITICAL cancel-clobber: POST /v2/runs/{id}/cancel hardcoded status
  "closed" into the run's ledger row. A cleanup cancel landing after the
  final decision (before the sweep archives the backend) downgraded a
  genuinely completed run to "closed" — in the DB row AND the in-memory
  session — permanently hiding its metrics. cancel_run now no-ops on
  terminal runs and reports the run's true state; BacktestBackend.cancel()
  guards the terminal check under the session lock (closes the peek→cancel
  race with a finalizing submit).
* _archive_run no longer swaps in the tombstone when the terminal row write
  fails — the backend stays live so the next sweep retries, instead of the
  row freezing in an active status with nothing left to reconcile it.
* v2 rows now leave "loading" (→ "running") when the background market-data
  load succeeds.
* RunStore's heartbeat-column migration tolerates losing the probe→ALTER
  race to a concurrently-starting worker (duplicate-column errors are the
  goal state, not a crash).
* ArchivedBacktestBackend.from_record recovers step_index/total_steps for
  completed runs from the decision log (restart-rehydrated runs reported a
  misleading 0/0).
* test_active_run_count_is_side_effect_free regained its count assertion
  (ledger-based) alongside the passivity invariants.

Accepted + documented in code: cross-surface create serialization on the
shared lock, and deliberate v2<->v1 cross-surface run visibility.

Suites after fixes: backend 837 passed / 0 failed; packaging 41 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
Upstream merged PR #67 plus two landing refreshes (#69/#70) while this
follow-up pass was in flight. Conflicts were confined to the shipped
frontend bundle: upstream's refresh was built before this branch's dead
component prune, and this branch's refresh predated upstream's brand
changes (Hero/Navbar/index.css, atltransparent logo, Inter 800). Resolved
by REBUILDING the bundle from the merged landing source — brand changes +
prune together (tsc green, vite build green, index-CoQ1VbAs.css 31.71 kB,
headless smoke: renders, auth modal, 0 console errors) — and pointing
index.html at the rebuilt hashed assets (content hashes + the vercel.json
immutable/no-cache split make the ?v= params redundant).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
… reads, bounded registry

Seven findings from the PR #71 review, each red-green pinned in
tests/test_pr71_review_fixes.py:

1. cancel-during-loading race — load_market_data published waiting_decision
   WITHOUT _step_lock, defeating cancel()'s lock guard; the background _load
   also wrote failed/running unconditionally. A cancel landing during the load
   was silently reverted (run kept running, cap slot freed, client misled).
   Both load paths now respect an already-terminal (cancelled) status under the
   lock.

2. archived status() dropped baseline_run_ids/compare_url — the live
   get_status() returns them for a completed run but the tombstone omitted
   them, so the same endpoint changed shape once the reaper archived the run.
   ArchivedBacktestBackend.status() now rebuilds both from the persisted
   baseline columns (shared build_compare_url helper).

3. rehydrated failed/closed runs reported 0/0 steps — from_record only
   recovered counts for completed runs (from the decision log). protocol_runs
   now persists step_index/total_steps at archive time; from_record reads them
   (decision-log fallback kept for legacy rows).

4. cancel hung during finalize — cancel() blocked on _step_lock held through
   _finalize's baseline generation (seconds-to-minutes). _finalize now
   publishes "completed" before the best-effort baseline block, and cancel()
   short-circuits on a lock-free terminal read.

5. v2 _runs never pruned — the reaper now evicts prior-pass archived tombstones
   (a later read rehydrates from the terminal row), bounding _runs by ACTIVE
   runs instead of total historical runs.

6/7. dedupe — extracted a shared enable_wal(db_path) helper (was duplicated
   verbatim across both DB layers) and a single TERMINAL_STATUSES constant in
   execution/base.py (was defined three times).

Suite: 843 passed / 0 failed; packaging 41.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KoZVnrSPdYNyrxSfrUwaok
…eanup

Quality-only cleanups (no behavior change; suite 843/0). From a 4-lens
review of the PR diff:

- Extract shared build_final_metrics(run) in external_run_service (mirrors
  build_compare_url); the 9-key completed-run metrics dict was duplicated
  verbatim in ArchivedBacktestBackend.status().
- BacktestBackend.advance() calls session.drain_expired() instead of
  get_current_step(), which rebuilt + discarded a full market snapshot
  under _step_lock on every reaper pass, per live run.
- _archive_run derived step_index/total_steps twice with divergent
  is-None vs or fallbacks (disagreeing on a real 0) — collapse to one helper.
- Drop the dead `not isinstance(backend, ArchivedBacktestBackend)` guard in
  cancel_run (active is read from is_active(), which a tombstone always
  reports False, so active already implies a live backend).
- reap_runs and the _v2_fakes test double use the TERMINAL_STATUSES constant
  instead of hardcoded status tuples (the constant's own anti-drift purpose).
- get_runs_by_mode routes through _parse_run_row like the other five
  agent_runs readers (no consumer depends on the raw-string metadata shape).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KoZVnrSPdYNyrxSfrUwaok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant