fix(miles): track pause/continue refs for timeout cancellation#23
Open
TianyeGGBond wants to merge 1 commit into
Open
Conversation
The atomic sync unit records every Ray ObjectRef it issues so a wait_for timeout can fan out ray.cancel(force=True). The pause_generation and continue_generation refs around the finalize fan-out were not recorded, so a timeout landing in that window left those RPCs running on the engines. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
MilesModelUpdateService._run_atomic_unitrecords every RayObjectRefitissues into
inflight_refs, so that anasyncio.wait_fortimeout (or an outercancellation) can fan out
ray.cancel(force=True)on each one. Otherwise thelocal coroutine cancels but the remote actor methods keep running and keep
holding the cache-owner lock, blocking the next sync.
The
pause_generation/continue_generationRPCs that bracket thefinalize_weight_updatefan-out were the only refs in the unit not recorded.A timeout landing in that window leaves those RPCs running on the engines —
the exact leak
inflight_refsexists to prevent.Change
pipeline/miles_model_update_service.pyinflight_refs.extend(pause_refs)right after issuing the pre-finalizepause_generationcalls.inflight_refs.extend(cont_refs)right after issuing the post-finalizecontinue_generationcalls.Two lines; no behavior change on the success path, only correct cleanup on
timeout/cancel.