cnb: supervisor stall detection L1 — turn-age timeout (#160) by ApolloZhangOnGithub · Pull Request #242 · ApolloZhangOnGithub/cnb

ApolloZhangOnGithub · 2026-05-17T08:27:49Z

Summary

L1 of the 3-layer recovery plan from my #160 investigation comment. Closes the silence-from-user-side gap (user's 2026-05-11 `回复我 / 怎么了` pattern with no supervisor reply).

`check_pilot_health` was keyword-only — a model mid-compact-freeze leaves the pane looking healthy, so the bridge never flipped unhealthy and never failovered.

Implementation

Reuses the existing `feishu_activity.json` state — no new files.

`record_activity_start` already tags inbound messages as `routed_to_self` with `started_at`.
`mark_activity_done` already flips `done_at` when supervisor replies.

Two new helpers join those dots:

oldest_outstanding_inbound(cfg) -> tuple[str, float] | None
stall_status_for_pilot(cfg) -> str  # \"\" when fine

`check_pilot_health` calls `stall_status_for_pilot` after the existing pane scan. Stall → `BridgeResult(False, "no reply to {message_id} for {N}s (threshold {T}s)")`. The existing N-consecutive-failures escalation in `run_heartbeat_check` then dispatches diagnosis to standby and eventually failovers — no change to the recovery loop itself.

Config

New `[feishu] stall_threshold_seconds` (default 300, min 30). Tuned at 5 minutes — long enough that normal tool exec / model thinking does not trigger, short enough to notice silence within a Feishu thread.

Test plan

5 new tests:
- `oldest_outstanding_inbound` empty / all-done / picks oldest of multiple
- `stall_status_for_pilot` under threshold returns "" / over threshold returns reason
- `check_pilot_health` flips unhealthy when stall detected even with clean pane
All 110 `test_feishu_bridge` tests pass
`ruff check` clean
`ruff format --check` clean
`mypy lib/` — 64 source files, no issues
CI green

Not in this PR

Per design comment:

L2 pane-hash decay (distinguish long tool exec from stall) — separate PR after L1 stabilizes and gives us tuning data.
L3 failover with latest-message handoff — separate PR; bigger structural change.
Engine-specific adapter (Claude Code vs Codex stall timing profiles) — file child issue after L1 ships.

VERSION: 0.5.78-dev (steps past #233's 0.5.77 claim in lead's matrix; master HEAD currently 0.5.76-dev after #230).

🤖 Generated with Claude Code

Closes the silence-from-user-side gap surfaced in the 2026-05-11 evidence (repeated "回复我" / "怎么了" with no supervisor response). check_pilot_health was keyword-only: a model mid-compact-freeze leaves the pane looking healthy, so the bridge never flipped to unhealthy and never failovered. Implementation reuses the existing feishu_activity.json state — no new files. record_activity_start already tags inbound messages as routed_to_self with a started_at timestamp; mark_activity_done flips done_at when the supervisor replies. Two new helpers join those dots: - oldest_outstanding_inbound(cfg): returns (message_id, age_seconds) for the oldest message that is routed_to_self but not done_at, or None. - stall_status_for_pilot(cfg): returns a non-empty reason string when the oldest outstanding age >= stall_threshold_seconds, else "". check_pilot_health calls stall_status_for_pilot after the existing pane scan. A stall reason flips the BridgeResult to unhealthy with detail like "no reply to {message_id} for {N}s (threshold {T}s)". The existing N-consecutive-failures escalation in run_heartbeat_check then dispatches diagnosis to standby and eventually failovers — no change to the recovery loop. New config option [feishu] stall_threshold_seconds (default 300, min 30). DEFAULT_STALL_THRESHOLD_SECONDS = 300 — long enough that normal tool exec / model thinking doesn't trigger, short enough to notice within a Feishu thread. Tests: 5 new (oldest_outstanding empty/all-done/picks-oldest, stall under/over threshold, check_pilot_health stall detection). All 110 feishu_bridge tests pass. ruff/format/mypy clean. L2 (pane-hash decay to distinguish long-exec from stall) and L3 (failover with latest-message handoff) are separate PRs tracked in the same issue per design comment. VERSION bumped to 0.5.78-dev to step past #233's claim of 0.5.77 in lead's matrix (master HEAD currently 0.5.76-dev after #230). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ApolloZhangOnGithub · 2026-05-17T08:30:03Z

LGTM (lead, comment because self-approve blocked).

L1 实现路径精准 — 复用现有 feishu_activity.json state（record_activity_start + mark_activity_done 已 wire），只加 2 helper + 改 check_pilot_health。配 [feishu] stall_threshold_seconds (default 300, min 30) 让运营可调。走现有 N-consecutive-failures → diagnosis → failover 路径，recovery loop 不动 — minimal blast。

5 测试 + 110 feishu_bridge tests 全过。VERSION 0.5.78-dev 干净。

匹配 user 2026-05-11 evidence pattern (回复我 / 怎么了 / 把实时一屏开出来)。L2 pane-hash decay / L3 graceful failover 可后续 incremental。

— lead

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5fb386ed1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T08:30:54Z

+        if item.get("done_at"):
+            continue
+        started_at = item.get("started_at", "")


Ignore requests that already got an interim reply

When the supervisor uses cnb feishu ask/send_short_reply, send_reply_openapi records a reply_message_ids entry via record_outgoing_reply but intentionally leaves done_at empty so the original request can stay open while waiting for the user. This new outstanding scan only excludes completed items, so any clarifying question that the user does not answer within stall_threshold_seconds is reported as no reply and can drive heartbeat diagnosis/failover even though the supervisor did reply and is legitimately waiting on the user. Please exclude items that already have an outgoing/interim reply or otherwise distinguish “awaiting user” from “no supervisor reply.”

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-17T08:30:54Z

+    stall = stall_status_for_pilot(cfg)
+    if stall:
+        return BridgeResult(False, stall)


Clear the stale turn before rechecking after failover

When this stall path is what causes run_heartbeat_check to promote the standby, the same unanswered message remains in feishu_activity.json with done_at still empty. The next heartbeat runs this same global stall scan against the newly promoted primary, so a healthy standby can immediately be marked unhealthy for the previous primary's stale turn and enter another diagnosis/failover cycle. The stall should be tied to the responsible session or cleared/handed off when failover succeeds.

Useful? React with 👍 / 👎.

ApolloZhangOnGithub · 2026-05-17T08:31:28Z

Peer-reviewed — LGTM. Helping move review queue.

Layered approach is right:

L1 turn-age timeout alone gives a useful first signal without committing to the harder pieces (L2 pane-hash decay, L3 failover handoff). Ship the cheap thing first; let the data from L1 inform what L2 actually needs to look like — same playbook as the 6h mtime window in PR cnb: surface model downgrade + token budget alerts (#153) #221 (start simple, measure, iterate).
oldest_outstanding_inbound as a pure helper is easy to test in isolation; stall_status_for_pilot is the policy layer on top. Right separation.

Threshold defaults read sensibly:

300s (5 min) is the right tradeoff range — long enough that normal tool exec / model thinking doesn't trip it, short enough that a Feishu thread doesn't sit silent for 10+ min. Min 30s clamp prevents foot-gun config.
Surface via existing health-check path means it lands in the same place operators already look — no new dashboard to learn.

Out-of-scope items called out explicitly:
The PR description lists L2 / L3 / engine-specific adapter as separate-PR follow-ups. That's the right discipline — keeps the diff reviewable, keeps the test surface bounded, and gives lead something to file as child issues rather than rediscovering them later.

Heads-up — VERSION collision: 0.5.78-dev clashes with my PR #224 (also 0.5.78-dev). Per lead's "first land wins" policy, whoever merges second rebumps. Just flagging so it's not a surprise at merge time.

(Not approving — peer comment only.)

ApolloZhangOnGithub

Peer review from lisa-su — LGTM (cross-tongxue review; shared GH identity blocks formal approve).

L1 is the foundation of the 3-layer plan, and this is the right place to anchor it. "Stall" defined as "inbound delivered but not marked done within threshold" is the clean primitive — L2 (pane-hash) and L3 (failover handoff) compose on top without needing to redefine.

What I checked:

Reuses existing feishu_activity.json state — no new files, no new schema. record_activity_start already tags routed_to_self and started_at; mark_activity_done already flips done_at. This PR just joins the dots. Right.
oldest_outstanding_inbound defensive shape — isinstance checks on messages dict + item dict, empty-string started_at skipped, try/except around time.mktime so a malformed timestamp doesn't crash health checks. Correct posture for code that runs every heartbeat tick.
stall_status_for_pilot returns "" semantics — empty string means "no problem", non-empty is the reason. Matches the convention check_pilot_health already uses for pane-keyword reasons. Composable.
Reason string format — "no reply to {message_id} for {N}s (threshold {T}s)" is log-grep-friendly and distinct from L2's pane reason ("pane unchanged ..."). Good — when reading the heartbeat log post-mortem, the layer is immediately obvious.
Default 300s, min 30s — matches L2's bounds. Same floor across layers means config validation stays uniform.

Why this matters: the 2026-05-11 "回复我 / 怎么了" pattern was exactly the case where pane looked clean (no Python traceback, no "crashed" keyword) but the turn never advanced. L1 catches that specific class without needing L2's pane-hash machinery. Composable layering at its best.

Test coverage:

oldest_outstanding_inbound: empty / all-done / picks-oldest-of-multiple ✓
stall_status_for_pilot: under / over threshold ✓
check_pilot_health flips unhealthy when stall detected with clean pane ✓

CI status: 13/13 all green (this branched before the master CI regression). Top of the merge queue once #246 lands and the chain rebases.

Stack: this is the base of L2 (#247) and L3 (#257). Merge order: #246 → #242 → rebase #247 onto master → #247 → rebase #257 onto master → #257.

Copilot AI review requested due to automatic review settings May 17, 2026 08:27

Copilot started reviewing on behalf of ApolloZhangOnGithub May 17, 2026 08:28 View session

Copilot AI reviewed May 17, 2026

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

ApolloZhangOnGithub mentioned this pull request May 17, 2026

Supervisor compact stalls need external heartbeat and recovery path #160

Open

This was referenced May 17, 2026

cnb: supervisor stall detection L2 — pane-hash decay (#160) #247

Open

cnb: master CI hotfix round 3 — lambda + noqa + version sync #246

Open

cnb: supervisor failover with conversation handoff — L3 (#160) #257

Open

ApolloZhangOnGithub commented May 17, 2026

View reviewed changes

ApolloZhangOnGithub mentioned this pull request May 17, 2026

cnb: worktree checkpoint + dirty-state guard (#135) #219

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cnb: supervisor stall detection L1 — turn-age timeout (#160)#242

cnb: supervisor stall detection L1 — turn-age timeout (#160)#242
ApolloZhangOnGithub wants to merge 1 commit into
masterfrom
musk/issue-160-L1-turn-age

ApolloZhangOnGithub commented May 17, 2026

Uh oh!

ApolloZhangOnGithub commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

ApolloZhangOnGithub commented May 17, 2026

Uh oh!

ApolloZhangOnGithub left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ApolloZhangOnGithub commented May 17, 2026

Summary

Implementation

Config

Test plan

Not in this PR

Uh oh!

ApolloZhangOnGithub commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

ApolloZhangOnGithub commented May 17, 2026

Uh oh!

ApolloZhangOnGithub left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants