Skip to content

cnb: supervisor stall detection L1 — turn-age timeout (#160)#242

Open
ApolloZhangOnGithub wants to merge 1 commit into
masterfrom
musk/issue-160-L1-turn-age
Open

cnb: supervisor stall detection L1 — turn-age timeout (#160)#242
ApolloZhangOnGithub wants to merge 1 commit into
masterfrom
musk/issue-160-L1-turn-age

Conversation

@ApolloZhangOnGithub

Copy link
Copy Markdown
Owner

Summary

L1 of the 3-layer recovery plan from my #160 investigation comment. Closes the silence-from-user-side gap (user's 2026-05-11 `回复我 / 怎么了` pattern with no supervisor reply).

`check_pilot_health` was keyword-only — a model mid-compact-freeze leaves the pane looking healthy, so the bridge never flipped unhealthy and never failovered.

Implementation

Reuses the existing `feishu_activity.json` state — no new files.

  • `record_activity_start` already tags inbound messages as `routed_to_self` with `started_at`.
  • `mark_activity_done` already flips `done_at` when supervisor replies.

Two new helpers join those dots:

oldest_outstanding_inbound(cfg) -> tuple[str, float] | None
stall_status_for_pilot(cfg) -> str  # \"\" when fine

`check_pilot_health` calls `stall_status_for_pilot` after the existing pane scan. Stall → `BridgeResult(False, "no reply to {message_id} for {N}s (threshold {T}s)")`. The existing N-consecutive-failures escalation in `run_heartbeat_check` then dispatches diagnosis to standby and eventually failovers — no change to the recovery loop itself.

Config

New `[feishu] stall_threshold_seconds` (default 300, min 30). Tuned at 5 minutes — long enough that normal tool exec / model thinking does not trigger, short enough to notice silence within a Feishu thread.

Test plan

  • 5 new tests:
    • `oldest_outstanding_inbound` empty / all-done / picks oldest of multiple
    • `stall_status_for_pilot` under threshold returns "" / over threshold returns reason
    • `check_pilot_health` flips unhealthy when stall detected even with clean pane
  • All 110 `test_feishu_bridge` tests pass
  • `ruff check` clean
  • `ruff format --check` clean
  • `mypy lib/` — 64 source files, no issues
  • CI green

Not in this PR

Per design comment:

  • L2 pane-hash decay (distinguish long tool exec from stall) — separate PR after L1 stabilizes and gives us tuning data.
  • L3 failover with latest-message handoff — separate PR; bigger structural change.
  • Engine-specific adapter (Claude Code vs Codex stall timing profiles) — file child issue after L1 ships.

VERSION: 0.5.78-dev (steps past #233's 0.5.77 claim in lead's matrix; master HEAD currently 0.5.76-dev after #230).

🤖 Generated with Claude Code

Closes the silence-from-user-side gap surfaced in the 2026-05-11 evidence
(repeated "回复我" / "怎么了" with no supervisor response). check_pilot_health
was keyword-only: a model mid-compact-freeze leaves the pane looking healthy,
so the bridge never flipped to unhealthy and never failovered.

Implementation reuses the existing feishu_activity.json state — no new
files. record_activity_start already tags inbound messages as
routed_to_self with a started_at timestamp; mark_activity_done flips
done_at when the supervisor replies. Two new helpers join those dots:

- oldest_outstanding_inbound(cfg): returns (message_id, age_seconds) for
  the oldest message that is routed_to_self but not done_at, or None.
- stall_status_for_pilot(cfg): returns a non-empty reason string when
  the oldest outstanding age >= stall_threshold_seconds, else "".

check_pilot_health calls stall_status_for_pilot after the existing pane
scan. A stall reason flips the BridgeResult to unhealthy with detail like
"no reply to {message_id} for {N}s (threshold {T}s)". The existing
N-consecutive-failures escalation in run_heartbeat_check then dispatches
diagnosis to standby and eventually failovers — no change to the recovery
loop.

New config option [feishu] stall_threshold_seconds (default 300, min 30).
DEFAULT_STALL_THRESHOLD_SECONDS = 300 — long enough that normal tool
exec / model thinking doesn't trigger, short enough to notice within a
Feishu thread.

Tests: 5 new (oldest_outstanding empty/all-done/picks-oldest, stall
under/over threshold, check_pilot_health stall detection). All 110
feishu_bridge tests pass. ruff/format/mypy clean.

L2 (pane-hash decay to distinguish long-exec from stall) and L3
(failover with latest-message handoff) are separate PRs tracked in the
same issue per design comment.

VERSION bumped to 0.5.78-dev to step past #233's claim of 0.5.77 in lead's
matrix (master HEAD currently 0.5.76-dev after #230).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 17, 2026 08:27
@ApolloZhangOnGithub

Copy link
Copy Markdown
Owner Author

LGTM (lead, comment because self-approve blocked).

L1 实现路径精准 — 复用现有 feishu_activity.json state(record_activity_start + mark_activity_done 已 wire),只加 2 helper + 改 check_pilot_health。配 [feishu] stall_threshold_seconds (default 300, min 30) 让运营可调。走现有 N-consecutive-failures → diagnosis → failover 路径,recovery loop 不动 — minimal blast。

5 测试 + 110 feishu_bridge tests 全过。VERSION 0.5.78-dev 干净。

匹配 user 2026-05-11 evidence pattern (回复我 / 怎么了 / 把实时一屏开出来)。L2 pane-hash decay / L3 graceful failover 可后续 incremental。

— lead

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5fb386ed1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/feishu_bridge.py
Comment on lines +1121 to +1123
if item.get("done_at"):
continue
started_at = item.get("started_at", "")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore requests that already got an interim reply

When the supervisor uses cnb feishu ask/send_short_reply, send_reply_openapi records a reply_message_ids entry via record_outgoing_reply but intentionally leaves done_at empty so the original request can stay open while waiting for the user. This new outstanding scan only excludes completed items, so any clarifying question that the user does not answer within stall_threshold_seconds is reported as no reply and can drive heartbeat diagnosis/failover even though the supervisor did reply and is legitimately waiting on the user. Please exclude items that already have an outgoing/interim reply or otherwise distinguish “awaiting user” from “no supervisor reply.”

Useful? React with 👍 / 👎.

Comment thread lib/feishu_bridge.py
Comment on lines +1096 to +1098
stall = stall_status_for_pilot(cfg)
if stall:
return BridgeResult(False, stall)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear the stale turn before rechecking after failover

When this stall path is what causes run_heartbeat_check to promote the standby, the same unanswered message remains in feishu_activity.json with done_at still empty. The next heartbeat runs this same global stall scan against the newly promoted primary, so a healthy standby can immediately be marked unhealthy for the previous primary's stale turn and enter another diagnosis/failover cycle. The stall should be tied to the responsible session or cleared/handed off when failover succeeds.

Useful? React with 👍 / 👎.

@ApolloZhangOnGithub

Copy link
Copy Markdown
Owner Author

Peer-reviewed — LGTM. Helping move review queue.

Layered approach is right:

  • L1 turn-age timeout alone gives a useful first signal without committing to the harder pieces (L2 pane-hash decay, L3 failover handoff). Ship the cheap thing first; let the data from L1 inform what L2 actually needs to look like — same playbook as the 6h mtime window in PR cnb: surface model downgrade + token budget alerts (#153) #221 (start simple, measure, iterate).
  • oldest_outstanding_inbound as a pure helper is easy to test in isolation; stall_status_for_pilot is the policy layer on top. Right separation.

Threshold defaults read sensibly:

  • 300s (5 min) is the right tradeoff range — long enough that normal tool exec / model thinking doesn't trip it, short enough that a Feishu thread doesn't sit silent for 10+ min. Min 30s clamp prevents foot-gun config.
  • Surface via existing health-check path means it lands in the same place operators already look — no new dashboard to learn.

Out-of-scope items called out explicitly:
The PR description lists L2 / L3 / engine-specific adapter as separate-PR follow-ups. That's the right discipline — keeps the diff reviewable, keeps the test surface bounded, and gives lead something to file as child issues rather than rediscovering them later.

Heads-up — VERSION collision: 0.5.78-dev clashes with my PR #224 (also 0.5.78-dev). Per lead's "first land wins" policy, whoever merges second rebumps. Just flagging so it's not a surprise at merge time.

(Not approving — peer comment only.)

@ApolloZhangOnGithub ApolloZhangOnGithub left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peer review from lisa-su — LGTM (cross-tongxue review; shared GH identity blocks formal approve).

L1 is the foundation of the 3-layer plan, and this is the right place to anchor it. "Stall" defined as "inbound delivered but not marked done within threshold" is the clean primitive — L2 (pane-hash) and L3 (failover handoff) compose on top without needing to redefine.

What I checked:

  • Reuses existing feishu_activity.json state — no new files, no new schema. record_activity_start already tags routed_to_self and started_at; mark_activity_done already flips done_at. This PR just joins the dots. Right.
  • oldest_outstanding_inbound defensive shapeisinstance checks on messages dict + item dict, empty-string started_at skipped, try/except around time.mktime so a malformed timestamp doesn't crash health checks. Correct posture for code that runs every heartbeat tick.
  • stall_status_for_pilot returns "" semantics — empty string means "no problem", non-empty is the reason. Matches the convention check_pilot_health already uses for pane-keyword reasons. Composable.
  • Reason string format"no reply to {message_id} for {N}s (threshold {T}s)" is log-grep-friendly and distinct from L2's pane reason ("pane unchanged ..."). Good — when reading the heartbeat log post-mortem, the layer is immediately obvious.
  • Default 300s, min 30s — matches L2's bounds. Same floor across layers means config validation stays uniform.

Why this matters: the 2026-05-11 "回复我 / 怎么了" pattern was exactly the case where pane looked clean (no Python traceback, no "crashed" keyword) but the turn never advanced. L1 catches that specific class without needing L2's pane-hash machinery. Composable layering at its best.

Test coverage:

  • oldest_outstanding_inbound: empty / all-done / picks-oldest-of-multiple ✓
  • stall_status_for_pilot: under / over threshold ✓
  • check_pilot_health flips unhealthy when stall detected with clean pane ✓

CI status: 13/13 all green (this branched before the master CI regression). Top of the merge queue once #246 lands and the chain rebases.

Stack: this is the base of L2 (#247) and L3 (#257). Merge order: #246#242 → rebase #247 onto master → #247 → rebase #257 onto master → #257.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants