refactor: commit workflow-registry transitions atomically by kaiitunnz · Pull Request #72 · mlsys-io/FlowMesh

kaiitunnz · 2026-06-13T09:15:17Z

Purpose

TaskRuntime persisted each task transition as several independent Redis writes — the task records via save_task_states, the workflow status-set membership via a per-status mark_task_* verb, and the schedule snapshot via save_workflow_sched — each opening its own transaction. A crash between any two left durable state half-applied (e.g. a task in the cancelled set whose persisted record still says PENDING, so rehydrate re-runs it). The API-driven workflow cancel was the sharpest case: unlike the event-driven transitions, it has no durable event to replay, so nothing heals a partial write. This was raised in review on #64.

Changes

Atomic primitive — src/server/registries/workflow.py: add WorkflowRegistry.commit_transition, which takes a transition delta as explicit keyword params (record upserts, per-status membership moves — dispatched / pending / done / failed / cancelled — and an optional WorkflowSched) and applies the whole delta in a single control_pipeline() MULTI/EXEC. Remove the per-status verbs (mark_task_dispatched/pending/done/failed/cancelled, sync + async) and the sync save_task_states / save_workflow_sched they were paired with.
Runtime migration — src/server/task/runtime.py: every transition (dispatch, requeue/mark_pending, terminal mark_succeeded / mark_failed / mark_cancelled, cancel_workflow, and the replay re-persist) now commits one commit_transition from its already-applied in-memory state as the single last step. _persist_terminal_locked groups terminal tasks by workflow, folds the schedule snapshot in, warns on (and skips) any non-terminal task, and skips the commit entirely when a workflow has no terminal task to move; cancel_workflow commits cancelling records + cancelled membership + schedule in one transaction.
Tests — the three registry doubles move to a single commit_transition; the five persist-failure tests now fault-inject on that one choke point; two new tests assert the cancel guarantee directly (test_cancel_workflow_commits_atomically_on_crash, test_rehydrate_restores_cancelled_workflow).
Docs — docs/SERVICE_RESTARTS.md: document the atomic-transition contract and why cancel relies on it alone.
Dead-code prune (separate commit) — remove the long-unused update_workflow(_async), delete_task_states(_async), and sync load_workflow_sched from the registry.

Design

The durable shape of every transition is the same — upsert some records, move some tasks between status sets, maybe snapshot the schedule — so commit_transition takes that delta as explicit params and commits it atomically, rather than scattering it across per-status verbs or a one-off persist_cancellation helper. All affected keys live on the control Redis, so a single MULTI/EXEC covers them: the transition commits in full or not at all, and a crash mid-persist can never leave a half-applied state. Event-driven transitions keep their at-least-once replay backstop; the cancel path, which has none, now relies on this atomicity. The "persist last" invariant is preserved — in-memory mutations all happen before the single commit — so a failed write still leaves in-memory fully applied and replay heals as before. Behavior is otherwise unchanged: each status group reproduces the exact srem/sadd of the old verb, mark_cancelled still writes no schedule, and updated_at bumps on the same transitions as before (plus the schedule-only commits, where the durable state did change).

Test Plan

uv run pytest tests/server/
uv run pre-commit run --files src/server/registries/workflow.py src/server/task/runtime.py tests/server/task/test_runtime_rehydrate.py tests/server/task/test_runtime_epoch_order.py tests/server/dispatcher/helpers.py docs/SERVICE_RESTARTS.md

End-to-end against a single root + CPU worker on freshly built images carrying this branch (5 scenarios / 22 assertions). It targets what the unit suite cannot reach: the API-driven workflow cancel path, which has no event to replay and so relies on the single atomic commit alone, and the workflow status-set membership that derives the reported DONE / CANCELLED status. Scenarios: API cancel of (1) a queued, (2) an in-flight SSH sleep, and (3) a mixed PENDING+DISPATCHED workflow — each asserted to commit the whole cascade in one transaction, rehydrate CANCELLED across a flowmesh stack restart server, and never resurrect a task once a worker comes up; (4) completion deriving DONE idempotently across a restart; (5) the _persist_terminal_locked guard logging zero non-terminal warnings.

Test Result

$ uv run pytest tests/server/        # 398 passed
$ uv run pre-commit run --files ...  # isort / black / ruff / mypy / codespell / gitleaks passed

End-to-end suite: 22 passed, 0 failed (exit 0). The API-driven cancel persisted its full cascade and survived a server restart with no task resurrected in all three cancel scenarios (queued, in-flight, mixed-status); completion stayed DONE across a restart; and the terminal-persist guard logged 0 non-terminal warnings.

Follow-up

Cancel interrupts are delivered over Redis pub/sub, which has no replay. If the root restarts mid-cancel, rehydration restores the task's CANCELLING state but never re-sends the interrupt, so the cancellation can be silently dropped. The fix belongs in the worker lifecycle (re-deliver on re-register) and needs its own testing, so it is deferred to a future PR.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run and fixed any issues.
I have added or updated tests covering my changes.
I have verified that uv run pytest tests/server/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work.
If this is a breaking change, I have prefixed the PR title with [BREAKING].
I have updated documentation if user-facing behavior changed.

Drop update_workflow / update_workflow_async, delete_task_states / delete_task_states_async, and the sync load_workflow_sched — none had any caller. The rehydrate path uses load_workflow_sched_async, which stays. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Every TaskRuntime transition persisted its task records and its workflow status-set membership as separate Redis transactions, so a crash mid-persist could leave durable state half-applied. The API-driven workflow cancel was the sharpest case: unlike event-driven transitions, it has no replay to heal a partial write. Introduce WorkflowTransition and WorkflowRegistry.commit_transition, which applies a transition's record upserts, status-set membership moves, and optional schedule snapshot in one atomic control-Redis transaction. Each transition (dispatch, requeue, terminal, cancel, and replay re-persist) now builds one delta and commits it as the single last step, replacing the per-status mark_task_* verbs and the separate record / schedule writes. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

WorkflowTransition added a DTO between every caller and the registry while carrying no behavior of its own. Replace it with keyword-only params on commit_transition so each call site reads as a direct delta. While there, _persist_terminal_locked now collects only terminal tasks, warns when a non-terminal task reaches the path, and skips the commit for a workflow with no terminal move. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

The dispatcher and epoch-order stubs accepted **kwargs, so a renamed or typo'd commit_transition param would pass tests while breaking against the real registry. Give them the keyword-only signature so such drift fails loudly. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

Three comments. PTAL

timzsu · 2026-06-14T14:50:04Z

+            self._workflow_registry.commit_transition(
+                workflow_id,
+                records=self._records_locked(*touched),
+                cancelled=cancelled,
+                sched=self._sched_locked(workflow_id),
+            )


Two issues to consider:

cancel_workflow does not check for the existence of the workflow. Therefore, if the user passes a wrong ID to the cancel endpoint, we will still commit the transition, which might lead to a partial workflow state. Consider adding an existence check (workflow_exists_async) before it.

Interrupts are also sent via Redis, but they are not republished during rehydration. Can we republish the interrupts when rehydrating the cancellation?

Added the workflow existence check by checking the existence of task records corresponding to the workflow ID instead of workflow_exists_async to avoid an additional Redis round-trip.

This is a gap that should be deferred to a future PR because rehydration runs before workers are re-registered, and the interrupts are delivered with Redis pub-sub, causing them to be dropped. This requires a change to the worker lifecycle and more thorough testing. Added this to the PR description.

timzsu · 2026-06-14T15:02:55Z

+    def commit_transition(
+        self,
+        workflow_id: str,
+        *,
+        records: Sequence[PersistedTask] = (),
+        sched: WorkflowSched | None = None,
+        **_: Any,


Can we match the signature with the runtime? Now it swallows any input which is error-prone.

cancel_workflow committed a schedule snapshot and updated_at bump even when no task matched the id, orphaning workflow keys for an id that never existed and turning the cancel endpoint's intended 404 into a 500 on the partial hash. Return early when the workflow has no in-memory tasks. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

The rehydrate fake's commit_transition accepted **kwargs, so a renamed or typo'd param would pass tests while breaking against the real registry. Give it the keyword-only signature, matching the dispatcher and epoch-order stubs. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz mentioned this pull request Jun 13, 2026

feat: persist and rehydrate scheduling state across root restarts #64

Merged

8 tasks

Base automatically changed from feat/rolling-node-restart to main June 13, 2026 10:14

kaiitunnz force-pushed the refactor/atomic-workflow-transitions branch from e3261ef to 8f848c1 Compare June 13, 2026 12:52

kaiitunnz added 2 commits June 13, 2026 21:11

kaiitunnz force-pushed the refactor/atomic-workflow-transitions branch from 8f848c1 to f577fda Compare June 13, 2026 13:12

kaiitunnz added 3 commits June 13, 2026 21:44

refactor: minor refactor

007bb3f

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz marked this pull request as ready for review June 13, 2026 15:06

kaiitunnz requested a review from timzsu June 13, 2026 15:06

timzsu requested changes Jun 14, 2026

View reviewed changes

kaiitunnz added 2 commits June 15, 2026 01:03

kaiitunnz requested a review from timzsu June 14, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: commit workflow-registry transitions atomically#72

refactor: commit workflow-registry transitions atomically#72
kaiitunnz wants to merge 7 commits into
mainfrom
refactor/atomic-workflow-transitions

kaiitunnz commented Jun 13, 2026 •

edited

Loading

Uh oh!

timzsu left a comment

Uh oh!

timzsu Jun 14, 2026

Uh oh!

kaiitunnz Jun 14, 2026 •

edited

Loading

Uh oh!

timzsu Jun 14, 2026

Uh oh!

kaiitunnz Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiitunnz commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Follow-up

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

timzsu Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timzsu Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiitunnz commented Jun 13, 2026 •

edited

Loading

kaiitunnz Jun 14, 2026 •

edited

Loading