[fm] Add saga diagnosis engine#10592
Draft
smklein wants to merge 14 commits into
Draft
Conversation
The second fault management diagnosis engine: opens a case (keyed by saga_id) for any non-terminal saga that is either not making progress (no node event recorded within STALE_SAGA_THRESHOLD) or orphaned (owned by a Nexus that is no longer of the current generation). These are two independent fact kinds; a saga's case may carry either or both. A case closes when the saga reaches a terminal state. Tracks #10530. Supporting infrastructure: - DiagnosisEngineKind::Saga variant (Rust + DB enum) - fm_fact_saga typed child table with two fact kinds (not_progressing, owner_not_current_generation), per-kind nullable columns gated by a CHECK, participating in copy-forward + GC like other sitrep child tables - SagaFact / FactPayload::Saga and the ObservedSaga nexus-types projection - saga_list_running_or_unwinding_batched and a grouped saga_latest_node_event_times datastore query (the wall-clock progress signal); owner currency read DB-direct via the existing get_db_metadata_nexus_in_state Schema migration: fm-saga-de (version 263) adds the 'saga' enum value, the fm_fact_saga_kind and fm_fact_saga_orphan_reason enums, and the fm_fact_saga table.
Match the fm_fact_physical_disk convention: each kind's constraint validates that the columns it expects are present, and future kinds add their own constraint instead of rewriting an exhaustive CASE. This also stops enforcing that other kinds' columns are NULL, leaving room for kinds to share columns.
A case is an episode of a problem, not a dossier on the saga. When a flagged saga is still running but no condition holds anymore (progress resumed, owner re-adopted by a current Nexus), close the case rather than stripping its facts and leaving it open. Previously the case would be left open with zero facts, making it uninterpretable to the next analysis pass, which would then carry it forward unexamined forever, even after the saga terminated. The closing case keeps its facts attached as the record of why it existed; they age out with the case once it stops being copied forward and its sitreps are GC'd.
Facts are only read during sitrep load, which paginates over the (sitrep_id, id) primary key; nothing queries by case directly. Matches the same change to fm_fact_physical_disk.
The input report listed in-service disks but said nothing about the sagas visible to the saga diagnosis engine, leaving no way to answer "why did FM (not) flag this saga" from the report. List each non-terminal saga with its name, state, latest node-event time, and owner classification.
A mass-stuck-saga incident is exactly when this query runs with a large ID list; chunk the eq_any so it never becomes one giant statement. Chunks are disjoint, so GROUP BY keys never cross them and the concatenated results are identical to the single-statement form.
A saga case carries at most one fact per kind. The parent-case summary previously kept whichever duplicate happened to have the highest fact UUID and lost track of the rest, so a (pathological) duplicate would be carried forward and re-persisted in every sitrep, and removing the tracked copy while an untracked match survived would re-add a fresh fact, regenerating the duplicate pair. ParentSagaCase now separates the fact to consider when advancing the case (the lowest fact UUID of each kind) from duplicates, which are removed unconditionally with a warning. A corrupt case converges to one fact per kind in a single pass.
A saga with no current_sec yields owner_state = None, not Absent; the variant doc claimed otherwise. Also replace em-dashes in this file's comments.
A fact payload contains exactly what defines the condition for the analysis loop: the subject's ID plus the parameters whose change should rotate the fact. Anything a human wants for presentation is looked up from the database when a case is acted on; a case is only open while its saga row still exists, so the lookup always works. Accordingly: - saga_name leaves both payloads and the fm_fact_saga table; it never defines a condition. Names still appear in debug comments. - time_created leaves NotProgressing; the staleness condition folds it into last_event_time already. - adopt_generation leaves OwnerNotCurrentGeneration; it is not condition-defining, and sagas_reassign_sec bumping it would have rotated the fact UUID over a meaningless ownership shuffle.
Per omicron#10581, Nexus will explicitly abandon sagas it fails to recover for non-transient reasons. Abandonment is the beginning of an escalation, not a resolution: the saga may be holding partially-allocated resources and needs saga-specific manual remediation (RFD 555). Without this change the engine would close the case the moment a stuck saga was abandoned, exactly when the escalation should begin. - The projection now lists unfinished sagas (running, unwinding, or abandoned); only 'done' drops a saga from observation. ObservedSaga carries the three-variant ObservedSagaState; SagaProgressState remains the live-saga subset recorded in NotProgressing facts. - New SagaFact::Abandoned with a pure-identity payload (the condition is boolean, so nothing can rotate it). Abandonment supersedes the live-saga conditions: NotProgressing and OwnerNotCurrentGeneration facts are removed when a saga is abandoned, and the case carries the Abandoned fact alone, staying open until the saga row is deleted. - The module doc also records why the owner fact detects stranded sagas rather than wrongly-resumed ones, and what mitigates the latter.
Apply the lessons from the disk diagnoser review (the same pattern landed there in 6f2e4a5): - Close cases the engine cannot interpret (foreign fact payload, facts disagreeing on the saga, no facts) instead of skipping them, which left them open and unprocessable in every future sitrep with no path to closure. Reasons are a typed UninterpretableCase enum surfaced in the close comment and warn logs. Closing is safe for fault coverage: detection iterates observed sagas independently of case bookkeeping, so a saga that still needs attention gets a fresh well-formed case in the same pass. - Close duplicate cases for the same saga as superseded, keeping the lowest case ID. Previously the last-indexed case silently won the saga_id index while both cases had stale facts removed, so the loser decayed into an empty open case that could never close. - New tests: empty case closed, duplicate closed, corrupt case replaced by a fresh one in the same pass, and foreign-payload cases closed in both engines (newly testable, since two FactPayload variants now exist).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the a fault management diagnosis engine: the problematic saga diagnoser. Stacked on #10541, which establishes the typed per-engine fact tables this engine uses.
What it does
Opens a case (keyed by saga ID) for any unfinished saga that needs attention.
This PR uses three fact kinds:
NotProgressing: a live (running or unwinding) saga has recorded no node event for longer thanSTALE_SAGA_THRESHOLD.OwnerNotCurrentGeneration: the saga'scurrent_secis a Nexus that will never advance it, either quiesced (an older generation that handed off) or expunged (nodb_metadata_nexusrecord). Note this detects stranded sagas, not wrongly-resumed ones; the latter risk should be mitigated by Nexus should explicitly abandon sagas that it fails to recover for non-transient reasons #10581 and saga quiesce.Abandoned: Nexus permanently gave up on recovering the saga (Nexus should explicitly abandon sagas that it fails to recover for non-transient reasons #10581). Abandonment is a signal to begin an escalation: the saga may be holding partially-allocated resources and needs saga-specific manual remediation.Inputs
The preparation phase builds an
ObservedSagaprojection, the saga analog of #10541'sInServiceDisk: all unfinished sagas (running, unwinding, or abandoned), joined with each saga's latest node-event time and its owner's classification againstdb_metadata_nexus. This is the executed view read directly from the database, and it is included in the analysis input report so "why did FM (not) flag this saga" is answerable from a sitrep.Case lifecycle
The two live-saga facts may coexist on one case; as conditions change, facts are removed and re-added under the same case (with stable fact UUIDs when the observation is unchanged). The case closes when the saga completes, or when every condition clears (progress resumes, owner re-adopted), and a recurrence later opens a fresh case. An abandoned saga's case never closes on its own; it persists, carrying the
Abandonedfact, until remediation removes the saga row. Actually remediating anAbandonedsaga seems like a case-by-case resolution, so I'm not doing it in this PR!Fact payload design
Payloads carry minimal condition-defining fields: the saga ID plus the parameters whose change means the condition itself changed (which rotates the fact UUID). Presentation data (e.g. the saga's name) is deliberately excluded; it can be looked up from the
sagarow when a case is acted on, and a case is only open while that row exists.Context
Abandonedfact kind exists so that landing Nexus should explicitly abandon sagas that it fails to recover for non-transient reasons #10581 escalates these cases instead of silently closing them.Review notes
STALE_SAGA_THRESHOLDis 30 minutes and deliberately easy to change; the field evidence from health check reported hung sagas that aren't actually running #10531 says the real class of problem manifests at days scale, while the slowest legitimate saga node bounds it from below. Happy to tune this in review.