Skip to content

[fm] Add saga diagnosis engine#10592

Draft
smklein wants to merge 14 commits into
fm-disk-diagnoser-typedfrom
fm-saga-diagnoser
Draft

[fm] Add saga diagnosis engine#10592
smklein wants to merge 14 commits into
fm-disk-diagnoser-typedfrom
fm-saga-diagnoser

Conversation

@smklein

@smklein smklein commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Adds the a fault management diagnosis engine: the problematic saga diagnoser. Stacked on #10541, which establishes the typed per-engine fact tables this engine uses.

What it does

Opens a case (keyed by saga ID) for any unfinished saga that needs attention.

This PR uses three fact kinds:

Inputs

The preparation phase builds an ObservedSaga projection, the saga analog of #10541's InServiceDisk: all unfinished sagas (running, unwinding, or abandoned), joined with each saga's latest node-event time and its owner's classification against db_metadata_nexus. This is the executed view read directly from the database, and it is included in the analysis input report so "why did FM (not) flag this saga" is answerable from a sitrep.

Case lifecycle

The two live-saga facts may coexist on one case; as conditions change, facts are removed and re-added under the same case (with stable fact UUIDs when the observation is unchanged). The case closes when the saga completes, or when every condition clears (progress resumes, owner re-adopted), and a recurrence later opens a fresh case. An abandoned saga's case never closes on its own; it persists, carrying the Abandoned fact, until remediation removes the saga row. Actually remediating an Abandoned saga seems like a case-by-case resolution, so I'm not doing it in this PR!

Fact payload design

Payloads carry minimal condition-defining fields: the saga ID plus the parameters whose change means the condition itself changed (which rotates the fact UUID). Presentation data (e.g. the saga's name) is deliberately excluded; it can be looked up from the saga row when a case is acted on, and a case is only open while that row exists.

Context

Review notes

smklein added 12 commits June 3, 2026 18:07
The second fault management diagnosis engine: opens a case (keyed by
saga_id) for any non-terminal saga that is either not making progress
(no node event recorded within STALE_SAGA_THRESHOLD) or orphaned (owned
by a Nexus that is no longer of the current generation). These are two
independent fact kinds; a saga's case may carry either or both. A case
closes when the saga reaches a terminal state. Tracks #10530.

Supporting infrastructure:

- DiagnosisEngineKind::Saga variant (Rust + DB enum)
- fm_fact_saga typed child table with two fact kinds (not_progressing,
  owner_not_current_generation), per-kind nullable columns gated by a
  CHECK, participating in copy-forward + GC like other sitrep child
  tables
- SagaFact / FactPayload::Saga and the ObservedSaga nexus-types
  projection
- saga_list_running_or_unwinding_batched and a grouped
  saga_latest_node_event_times datastore query (the wall-clock progress
  signal); owner currency read DB-direct via the existing
  get_db_metadata_nexus_in_state

Schema migration: fm-saga-de (version 263) adds the 'saga' enum value,
the fm_fact_saga_kind and fm_fact_saga_orphan_reason enums, and the
fm_fact_saga table.
Match the fm_fact_physical_disk convention: each kind's constraint
validates that the columns it expects are present, and future kinds add
their own constraint instead of rewriting an exhaustive CASE. This also
stops enforcing that other kinds' columns are NULL, leaving room for
kinds to share columns.
A case is an episode of a problem, not a dossier on the saga. When a
flagged saga is still running but no condition holds anymore (progress
resumed, owner re-adopted by a current Nexus), close the case rather
than stripping its facts and leaving it open. Previously the case would
be left open with zero facts, making it uninterpretable to the next
analysis pass, which would then carry it forward unexamined forever,
even after the saga terminated.

The closing case keeps its facts attached as the record of why it
existed; they age out with the case once it stops being copied forward
and its sitreps are GC'd.
Facts are only read during sitrep load, which paginates over the
(sitrep_id, id) primary key; nothing queries by case directly. Matches
the same change to fm_fact_physical_disk.
The input report listed in-service disks but said nothing about the
sagas visible to the saga diagnosis engine, leaving no way to answer
"why did FM (not) flag this saga" from the report. List each
non-terminal saga with its name, state, latest node-event time, and
owner classification.
A mass-stuck-saga incident is exactly when this query runs with a large
ID list; chunk the eq_any so it never becomes one giant statement.
Chunks are disjoint, so GROUP BY keys never cross them and the
concatenated results are identical to the single-statement form.
A saga case carries at most one fact per kind. The parent-case summary
previously kept whichever duplicate happened to have the highest fact
UUID and lost track of the rest, so a (pathological) duplicate would be
carried forward and re-persisted in every sitrep, and removing the
tracked copy while an untracked match survived would re-add a fresh
fact, regenerating the duplicate pair.

ParentSagaCase now separates the fact to consider when advancing the
case (the lowest fact UUID of each kind) from duplicates, which are
removed unconditionally with a warning. A corrupt case converges to one
fact per kind in a single pass.
A saga with no current_sec yields owner_state = None, not Absent; the
variant doc claimed otherwise. Also replace em-dashes in this file's
comments.
A fact payload contains exactly what defines the condition for the
analysis loop: the subject's ID plus the parameters whose change should
rotate the fact. Anything a human wants for presentation is looked up
from the database when a case is acted on; a case is only open while
its saga row still exists, so the lookup always works.

Accordingly:
- saga_name leaves both payloads and the fm_fact_saga table; it never
  defines a condition. Names still appear in debug comments.
- time_created leaves NotProgressing; the staleness condition folds it
  into last_event_time already.
- adopt_generation leaves OwnerNotCurrentGeneration; it is not
  condition-defining, and sagas_reassign_sec bumping it would have
  rotated the fact UUID over a meaningless ownership shuffle.
Per omicron#10581, Nexus will explicitly abandon sagas it fails to
recover for non-transient reasons. Abandonment is the beginning of an
escalation, not a resolution: the saga may be holding
partially-allocated resources and needs saga-specific manual
remediation (RFD 555). Without this change the engine would close the
case the moment a stuck saga was abandoned, exactly when the escalation
should begin.

- The projection now lists unfinished sagas (running, unwinding, or
  abandoned); only 'done' drops a saga from observation. ObservedSaga
  carries the three-variant ObservedSagaState; SagaProgressState
  remains the live-saga subset recorded in NotProgressing facts.
- New SagaFact::Abandoned with a pure-identity payload (the condition
  is boolean, so nothing can rotate it). Abandonment supersedes the
  live-saga conditions: NotProgressing and OwnerNotCurrentGeneration
  facts are removed when a saga is abandoned, and the case carries the
  Abandoned fact alone, staying open until the saga row is deleted.
- The module doc also records why the owner fact detects stranded
  sagas rather than wrongly-resumed ones, and what mitigates the
  latter.
@smklein smklein changed the title Fm saga diagnoser [fm] Add saga diagnosis engine Jun 11, 2026
smklein added 2 commits June 11, 2026 15:35
Apply the lessons from the disk diagnoser review (the same pattern
landed there in 6f2e4a5):

- Close cases the engine cannot interpret (foreign fact payload, facts
  disagreeing on the saga, no facts) instead of skipping them, which
  left them open and unprocessable in every future sitrep with no path
  to closure. Reasons are a typed UninterpretableCase enum surfaced in
  the close comment and warn logs. Closing is safe for fault coverage:
  detection iterates observed sagas independently of case bookkeeping,
  so a saga that still needs attention gets a fresh well-formed case in
  the same pass.

- Close duplicate cases for the same saga as superseded, keeping the
  lowest case ID. Previously the last-indexed case silently won the
  saga_id index while both cases had stale facts removed, so the loser
  decayed into an empty open case that could never close.

- New tests: empty case closed, duplicate closed, corrupt case replaced
  by a fresh one in the same pass, and foreign-payload cases closed in
  both engines (newly testable, since two FactPayload variants now
  exist).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant