Skip to content

fix(miles): track pause/continue refs for timeout cancellation#23

Open
TianyeGGBond wants to merge 1 commit into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/m11-track-pause-continue-refs
Open

fix(miles): track pause/continue refs for timeout cancellation#23
TianyeGGBond wants to merge 1 commit into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/m11-track-pause-continue-refs

Conversation

@TianyeGGBond

Copy link
Copy Markdown
Collaborator

Context

MilesModelUpdateService._run_atomic_unit records every Ray ObjectRef it
issues into inflight_refs, so that an asyncio.wait_for timeout (or an outer
cancellation) can fan out ray.cancel(force=True) on each one. Otherwise the
local coroutine cancels but the remote actor methods keep running and keep
holding the cache-owner lock, blocking the next sync.

The pause_generation / continue_generation RPCs that bracket the
finalize_weight_update fan-out were the only refs in the unit not recorded.
A timeout landing in that window leaves those RPCs running on the engines —
the exact leak inflight_refs exists to prevent.

Change

  • pipeline/miles_model_update_service.py
    • inflight_refs.extend(pause_refs) right after issuing the pre-finalize
      pause_generation calls.
    • inflight_refs.extend(cont_refs) right after issuing the post-finalize
      continue_generation calls.

Two lines; no behavior change on the success path, only correct cleanup on
timeout/cancel.

The atomic sync unit records every Ray ObjectRef it issues so a wait_for
timeout can fan out ray.cancel(force=True). The pause_generation and
continue_generation refs around the finalize fan-out were not recorded,
so a timeout landing in that window left those RPCs running on the engines.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant