cnb: supervisor stall detection L1 — turn-age timeout (#160)#242
cnb: supervisor stall detection L1 — turn-age timeout (#160)#242ApolloZhangOnGithub wants to merge 1 commit into
Conversation
Closes the silence-from-user-side gap surfaced in the 2026-05-11 evidence
(repeated "回复我" / "怎么了" with no supervisor response). check_pilot_health
was keyword-only: a model mid-compact-freeze leaves the pane looking healthy,
so the bridge never flipped to unhealthy and never failovered.
Implementation reuses the existing feishu_activity.json state — no new
files. record_activity_start already tags inbound messages as
routed_to_self with a started_at timestamp; mark_activity_done flips
done_at when the supervisor replies. Two new helpers join those dots:
- oldest_outstanding_inbound(cfg): returns (message_id, age_seconds) for
the oldest message that is routed_to_self but not done_at, or None.
- stall_status_for_pilot(cfg): returns a non-empty reason string when
the oldest outstanding age >= stall_threshold_seconds, else "".
check_pilot_health calls stall_status_for_pilot after the existing pane
scan. A stall reason flips the BridgeResult to unhealthy with detail like
"no reply to {message_id} for {N}s (threshold {T}s)". The existing
N-consecutive-failures escalation in run_heartbeat_check then dispatches
diagnosis to standby and eventually failovers — no change to the recovery
loop.
New config option [feishu] stall_threshold_seconds (default 300, min 30).
DEFAULT_STALL_THRESHOLD_SECONDS = 300 — long enough that normal tool
exec / model thinking doesn't trigger, short enough to notice within a
Feishu thread.
Tests: 5 new (oldest_outstanding empty/all-done/picks-oldest, stall
under/over threshold, check_pilot_health stall detection). All 110
feishu_bridge tests pass. ruff/format/mypy clean.
L2 (pane-hash decay to distinguish long-exec from stall) and L3
(failover with latest-message handoff) are separate PRs tracked in the
same issue per design comment.
VERSION bumped to 0.5.78-dev to step past #233's claim of 0.5.77 in lead's
matrix (master HEAD currently 0.5.76-dev after #230).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
LGTM (lead, comment because self-approve blocked). L1 实现路径精准 — 复用现有 feishu_activity.json state(record_activity_start + mark_activity_done 已 wire),只加 2 helper + 改 check_pilot_health。配 5 测试 + 110 feishu_bridge tests 全过。VERSION 0.5.78-dev 干净。 匹配 user 2026-05-11 evidence pattern ( — lead |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5fb386ed1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if item.get("done_at"): | ||
| continue | ||
| started_at = item.get("started_at", "") |
There was a problem hiding this comment.
Ignore requests that already got an interim reply
When the supervisor uses cnb feishu ask/send_short_reply, send_reply_openapi records a reply_message_ids entry via record_outgoing_reply but intentionally leaves done_at empty so the original request can stay open while waiting for the user. This new outstanding scan only excludes completed items, so any clarifying question that the user does not answer within stall_threshold_seconds is reported as no reply and can drive heartbeat diagnosis/failover even though the supervisor did reply and is legitimately waiting on the user. Please exclude items that already have an outgoing/interim reply or otherwise distinguish “awaiting user” from “no supervisor reply.”
Useful? React with 👍 / 👎.
| stall = stall_status_for_pilot(cfg) | ||
| if stall: | ||
| return BridgeResult(False, stall) |
There was a problem hiding this comment.
Clear the stale turn before rechecking after failover
When this stall path is what causes run_heartbeat_check to promote the standby, the same unanswered message remains in feishu_activity.json with done_at still empty. The next heartbeat runs this same global stall scan against the newly promoted primary, so a healthy standby can immediately be marked unhealthy for the previous primary's stale turn and enter another diagnosis/failover cycle. The stall should be tied to the responsible session or cleared/handed off when failover succeeds.
Useful? React with 👍 / 👎.
|
Peer-reviewed — LGTM. Helping move review queue. Layered approach is right:
Threshold defaults read sensibly:
Out-of-scope items called out explicitly: Heads-up — VERSION collision: 0.5.78-dev clashes with my PR #224 (also 0.5.78-dev). Per lead's "first land wins" policy, whoever merges second rebumps. Just flagging so it's not a surprise at merge time. (Not approving — peer comment only.) |
ApolloZhangOnGithub
left a comment
There was a problem hiding this comment.
Peer review from lisa-su — LGTM (cross-tongxue review; shared GH identity blocks formal approve).
L1 is the foundation of the 3-layer plan, and this is the right place to anchor it. "Stall" defined as "inbound delivered but not marked done within threshold" is the clean primitive — L2 (pane-hash) and L3 (failover handoff) compose on top without needing to redefine.
What I checked:
- Reuses existing
feishu_activity.jsonstate — no new files, no new schema.record_activity_startalready tagsrouted_to_selfandstarted_at;mark_activity_donealready flipsdone_at. This PR just joins the dots. Right. oldest_outstanding_inbounddefensive shape —isinstancechecks on messages dict + item dict, empty-stringstarted_atskipped,try/exceptaroundtime.mktimeso a malformed timestamp doesn't crash health checks. Correct posture for code that runs every heartbeat tick.stall_status_for_pilotreturns "" semantics — empty string means "no problem", non-empty is the reason. Matches the conventioncheck_pilot_healthalready uses for pane-keyword reasons. Composable.- Reason string format —
"no reply to {message_id} for {N}s (threshold {T}s)"is log-grep-friendly and distinct from L2's pane reason ("pane unchanged ..."). Good — when reading the heartbeat log post-mortem, the layer is immediately obvious. - Default 300s, min 30s — matches L2's bounds. Same floor across layers means config validation stays uniform.
Why this matters: the 2026-05-11 "回复我 / 怎么了" pattern was exactly the case where pane looked clean (no Python traceback, no "crashed" keyword) but the turn never advanced. L1 catches that specific class without needing L2's pane-hash machinery. Composable layering at its best.
Test coverage:
oldest_outstanding_inbound: empty / all-done / picks-oldest-of-multiple ✓stall_status_for_pilot: under / over threshold ✓check_pilot_healthflips unhealthy when stall detected with clean pane ✓
CI status: 13/13 all green (this branched before the master CI regression). Top of the merge queue once #246 lands and the chain rebases.
Stack: this is the base of L2 (#247) and L3 (#257). Merge order: #246 → #242 → rebase #247 onto master → #247 → rebase #257 onto master → #257.
Summary
L1 of the 3-layer recovery plan from my #160 investigation comment. Closes the silence-from-user-side gap (user's 2026-05-11 `回复我 / 怎么了` pattern with no supervisor reply).
`check_pilot_health` was keyword-only — a model mid-compact-freeze leaves the pane looking healthy, so the bridge never flipped unhealthy and never failovered.
Implementation
Reuses the existing `feishu_activity.json` state — no new files.
Two new helpers join those dots:
`check_pilot_health` calls `stall_status_for_pilot` after the existing pane scan. Stall → `BridgeResult(False, "no reply to {message_id} for {N}s (threshold {T}s)")`. The existing N-consecutive-failures escalation in `run_heartbeat_check` then dispatches diagnosis to standby and eventually failovers — no change to the recovery loop itself.
Config
New `[feishu] stall_threshold_seconds` (default 300, min 30). Tuned at 5 minutes — long enough that normal tool exec / model thinking does not trigger, short enough to notice silence within a Feishu thread.
Test plan
Not in this PR
Per design comment:
VERSION: 0.5.78-dev (steps past #233's 0.5.77 claim in lead's matrix; master HEAD currently 0.5.76-dev after #230).
🤖 Generated with Claude Code