From b930115b695cac074a54e148a63f0d6e3b1faba4 Mon Sep 17 00:00:00 2001 From: ysyneu Date: Fri, 26 Jun 2026 21:44:40 +0800 Subject: [PATCH] perf(skill): slim incident-summary output and serialize same-host monit-agent probes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P6 — incident-summary.sh dumped ~80K chars of toon per incident (81042/79075 in the audited sessions), most of it empty boilerplate, overflowing the inline output cap and forcing 3-4 sequential reads to page it back in. Root cause: the script appended `--output-format toon` to every command, which takes each verb's machine-readable branch and marshals the full raw response object (every empty field on `incident detail`, plus heavy blobs like a change's `labels.steps`). The DEFAULT (table/summary) renderer of each of these read verbs is a curated projection of exactly the summary-relevant fields (id/severity/status/title/ channel/timestamps/ai_summary/root_cause/…). Fix: drop `--output-format toon` from run() so each command uses its lean default renderer; that default IS the field projection a fault summary needs. Kept all six commands, set -uo pipefail, and the read-only "print real output" intent. P8 — monit-agent.md: parallelizing multiple probes against the SAME host hit the per-target concurrency limit and returned `code=overloaded`, forcing an identical retry that re-sent the growing context. Added a Gotchas line steering fan-out to serialize probes per target (batch a host's tools into one `invoke`) and parallelize only across distinct targets. Evidence: audit run audit-2026-06-26, sessions sess_TRdrnq5oA345qF66GgbYrY (P6) and sess_QmoruPkRjrwA2tcHvEbNvT (P8). Surfaced by /audit-ai-sre-sessions. These two files are kept byte-identical with their embedded copies in fc-safari (logic/runtime/bootstrap/skills/flashduty/); mirrored there in a linked PR. --- skills/flashduty/reference/monit-agent.md | 1 + skills/flashduty/scripts/incident-summary.sh | 14 ++++++++++---- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/skills/flashduty/reference/monit-agent.md b/skills/flashduty/reference/monit-agent.md index 5964907..f7572ef 100644 --- a/skills/flashduty/reference/monit-agent.md +++ b/skills/flashduty/reference/monit-agent.md @@ -54,6 +54,7 @@ Run up to 8 monit-agent tools concurrently on a target - **`ambiguous_target_kind` error** ⇒ the locator matched multiple kinds; re-issue with `--target-kind`. - A `target_unavailable` / `target_unreachable` error means the agent isn't connected — report it; don't retry endlessly or fall back to SSH. - Per-tool errors (`timeout`, `denied`, `unknown_tool`…) are reported per result, mutually exclusive with that tool's `data`. +- **Serialize per target; parallelize only across targets.** Each target enforces a per-target concurrency limit, so two `invoke`/`catalog` calls fired at the *same* locator at once make the second come back `code=overloaded` — forcing a context-bloating retry. Batch every tool for one host into a single `invoke` (its `tools` array already runs them concurrently agent-side); fan out in parallel across *distinct* targets, never against one. ## Worked example — top processes + disk on a host diff --git a/skills/flashduty/scripts/incident-summary.sh b/skills/flashduty/scripts/incident-summary.sh index 8288100..f9a52ac 100644 --- a/skills/flashduty/scripts/incident-summary.sh +++ b/skills/flashduty/scripts/incident-summary.sh @@ -7,8 +7,9 @@ # # usage: bash incident-summary.sh # -# To tie post-mortems to this incident specifically, re-run the last section with the -# channel_id from "incident detail": fduty incident post-mortem-list --channel-ids +# Section ⑥ lists recent post-mortems account-wide. To scope them to THIS incident's +# channel, read its channel_id (fduty incident info --incident-id --output-format +# toon | grep '^channel_id:') and re-run: fduty incident post-mortem-list --channel-ids # # Note: errexit (-e) is intentionally NOT set — every section must run even if one # command fails, so the summary stays as complete as possible. Each command's own @@ -21,9 +22,14 @@ if [ -z "$ID" ]; then exit 2 fi -run() { echo "===== fduty $* ====="; fduty "$@" --output-format toon 2>&1; echo; } +# Print each command's DEFAULT renderer (a curated table/summary that projects the +# summary-relevant fields), NOT --output-format toon: toon dumps the full raw objects +# — every empty field plus heavy blobs like a change's labels.steps — which overflowed +# the output cap and forced repeated paging. For these read verbs the lean default IS +# the field projection a fault summary needs (id/severity/status/title/channel/times/…). +run() { echo "===== fduty $* ====="; fduty "$@" 2>&1; echo; } -run incident detail "$ID" # ① 详情 + AI summary + alert counts + channel_id +run incident detail "$ID" # ① 详情 + AI summary + alert counts + channel run incident alerts "$ID" # ② contributing alerts run incident timeline "$ID" # ④ timeline run incident similar "$ID" --limit 5 # ⑤ similar past incidents (channel-backed)