Skip to content

refactor: commit workflow-registry transitions atomically#72

Open
kaiitunnz wants to merge 7 commits into
mainfrom
refactor/atomic-workflow-transitions
Open

refactor: commit workflow-registry transitions atomically#72
kaiitunnz wants to merge 7 commits into
mainfrom
refactor/atomic-workflow-transitions

Conversation

@kaiitunnz

@kaiitunnz kaiitunnz commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Purpose

TaskRuntime persisted each task transition as several independent Redis writes — the task records via save_task_states, the workflow status-set membership via a per-status mark_task_* verb, and the schedule snapshot via save_workflow_sched — each opening its own transaction. A crash between any two left durable state half-applied (e.g. a task in the cancelled set whose persisted record still says PENDING, so rehydrate re-runs it). The API-driven workflow cancel was the sharpest case: unlike the event-driven transitions, it has no durable event to replay, so nothing heals a partial write. This was raised in review on #64.

Changes

  • Atomic primitivesrc/server/registries/workflow.py: add WorkflowRegistry.commit_transition, which takes a transition delta as explicit keyword params (record upserts, per-status membership moves — dispatched / pending / done / failed / cancelled — and an optional WorkflowSched) and applies the whole delta in a single control_pipeline() MULTI/EXEC. Remove the per-status verbs (mark_task_dispatched/pending/done/failed/cancelled, sync + async) and the sync save_task_states / save_workflow_sched they were paired with.
  • Runtime migrationsrc/server/task/runtime.py: every transition (dispatch, requeue/mark_pending, terminal mark_succeeded / mark_failed / mark_cancelled, cancel_workflow, and the replay re-persist) now commits one commit_transition from its already-applied in-memory state as the single last step. _persist_terminal_locked groups terminal tasks by workflow, folds the schedule snapshot in, warns on (and skips) any non-terminal task, and skips the commit entirely when a workflow has no terminal task to move; cancel_workflow commits cancelling records + cancelled membership + schedule in one transaction.
  • Tests — the three registry doubles move to a single commit_transition; the five persist-failure tests now fault-inject on that one choke point; two new tests assert the cancel guarantee directly (test_cancel_workflow_commits_atomically_on_crash, test_rehydrate_restores_cancelled_workflow).
  • Docsdocs/SERVICE_RESTARTS.md: document the atomic-transition contract and why cancel relies on it alone.
  • Dead-code prune (separate commit) — remove the long-unused update_workflow(_async), delete_task_states(_async), and sync load_workflow_sched from the registry.

Design

The durable shape of every transition is the same — upsert some records, move some tasks between status sets, maybe snapshot the schedule — so commit_transition takes that delta as explicit params and commits it atomically, rather than scattering it across per-status verbs or a one-off persist_cancellation helper. All affected keys live on the control Redis, so a single MULTI/EXEC covers them: the transition commits in full or not at all, and a crash mid-persist can never leave a half-applied state. Event-driven transitions keep their at-least-once replay backstop; the cancel path, which has none, now relies on this atomicity. The "persist last" invariant is preserved — in-memory mutations all happen before the single commit — so a failed write still leaves in-memory fully applied and replay heals as before. Behavior is otherwise unchanged: each status group reproduces the exact srem/sadd of the old verb, mark_cancelled still writes no schedule, and updated_at bumps on the same transitions as before (plus the schedule-only commits, where the durable state did change).

Test Plan

uv run pytest tests/server/
uv run pre-commit run --files src/server/registries/workflow.py src/server/task/runtime.py tests/server/task/test_runtime_rehydrate.py tests/server/task/test_runtime_epoch_order.py tests/server/dispatcher/helpers.py docs/SERVICE_RESTARTS.md

End-to-end against a single root + CPU worker on freshly built images carrying this branch (5 scenarios / 22 assertions). It targets what the unit suite cannot reach: the API-driven workflow cancel path, which has no event to replay and so relies on the single atomic commit alone, and the workflow status-set membership that derives the reported DONE / CANCELLED status. Scenarios: API cancel of (1) a queued, (2) an in-flight SSH sleep, and (3) a mixed PENDING+DISPATCHED workflow — each asserted to commit the whole cascade in one transaction, rehydrate CANCELLED across a flowmesh stack restart server, and never resurrect a task once a worker comes up; (4) completion deriving DONE idempotently across a restart; (5) the _persist_terminal_locked guard logging zero non-terminal warnings.

Test Result

$ uv run pytest tests/server/        # 398 passed
$ uv run pre-commit run --files ...  # isort / black / ruff / mypy / codespell / gitleaks passed

End-to-end suite: 22 passed, 0 failed (exit 0). The API-driven cancel persisted its full cascade and survived a server restart with no task resurrected in all three cancel scenarios (queued, in-flight, mixed-status); completion stayed DONE across a restart; and the terminal-persist guard logged 0 non-terminal warnings.

Follow-up

Cancel interrupts are delivered over Redis pub/sub, which has no replay. If the root restarts mid-cancel, rehydration restores the task's CANCELLING state but never re-sends the interrupt, so the cancellation can be silently dropped. The fix belongs in the worker lifecycle (re-deliver on re-register) and needs its own testing, so it is deferred to a future PR.


Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run and fixed any issues.
  • I have added or updated tests covering my changes.
  • I have verified that uv run pytest tests/server/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work.
  • If this is a breaking change, I have prefixed the PR title with [BREAKING].
  • I have updated documentation if user-facing behavior changed.

Base automatically changed from feat/rolling-node-restart to main June 13, 2026 10:14
@kaiitunnz kaiitunnz force-pushed the refactor/atomic-workflow-transitions branch from e3261ef to 8f848c1 Compare June 13, 2026 12:52
Drop update_workflow / update_workflow_async, delete_task_states /
delete_task_states_async, and the sync load_workflow_sched — none had any
caller. The rehydrate path uses load_workflow_sched_async, which stays.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Every TaskRuntime transition persisted its task records and its workflow
status-set membership as separate Redis transactions, so a crash mid-persist
could leave durable state half-applied. The API-driven workflow cancel was the
sharpest case: unlike event-driven transitions, it has no replay to heal a
partial write.

Introduce WorkflowTransition and WorkflowRegistry.commit_transition, which
applies a transition's record upserts, status-set membership moves, and
optional schedule snapshot in one atomic control-Redis transaction. Each
transition (dispatch, requeue, terminal, cancel, and replay re-persist) now
builds one delta and commits it as the single last step, replacing the
per-status mark_task_* verbs and the separate record / schedule writes.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the refactor/atomic-workflow-transitions branch from 8f848c1 to f577fda Compare June 13, 2026 13:12
WorkflowTransition added a DTO between every caller and the registry while
carrying no behavior of its own. Replace it with keyword-only params on
commit_transition so each call site reads as a direct delta. While there,
_persist_terminal_locked now collects only terminal tasks, warns when a
non-terminal task reaches the path, and skips the commit for a workflow with
no terminal move.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The dispatcher and epoch-order stubs accepted **kwargs, so a renamed or
typo'd commit_transition param would pass tests while breaking against the
real registry. Give them the keyword-only signature so such drift fails loudly.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz marked this pull request as ready for review June 13, 2026 15:06
@kaiitunnz kaiitunnz requested a review from timzsu June 13, 2026 15:06

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three comments. PTAL

Comment on lines +1216 to +1221
self._workflow_registry.commit_transition(
workflow_id,
records=self._records_locked(*touched),
cancelled=cancelled,
sched=self._sched_locked(workflow_id),
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues to consider:

  1. cancel_workflow does not check for the existence of the workflow. Therefore, if the user passes a wrong ID to the cancel endpoint, we will still commit the transition, which might lead to a partial workflow state. Consider adding an existence check (workflow_exists_async) before it.
  2. Interrupts are also sent via Redis, but they are not republished during rehydration. Can we republish the interrupts when rehydrating the cancellation?

@kaiitunnz kaiitunnz Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added the workflow existence check by checking the existence of task records corresponding to the workflow ID instead of workflow_exists_async to avoid an additional Redis round-trip.
  2. This is a gap that should be deferred to a future PR because rehydration runs before workers are re-registered, and the interrupts are delivered with Redis pub-sub, causing them to be dropped. This requires a change to the worker lifecycle and more thorough testing. Added this to the PR description.

Comment on lines +78 to +84
def commit_transition(
self,
workflow_id: str,
*,
records: Sequence[PersistedTask] = (),
sched: WorkflowSched | None = None,
**_: Any,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we match the signature with the runtime? Now it swallows any input which is error-prone.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

cancel_workflow committed a schedule snapshot and updated_at bump even when
no task matched the id, orphaning workflow keys for an id that never existed
and turning the cancel endpoint's intended 404 into a 500 on the partial
hash. Return early when the workflow has no in-memory tasks.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The rehydrate fake's commit_transition accepted **kwargs, so a renamed or
typo'd param would pass tests while breaking against the real registry. Give
it the keyword-only signature, matching the dispatcher and epoch-order stubs.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz requested a review from timzsu June 14, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants