Warn when zero.Init silently falls back to a single rank (#8084)#8089
Open
akshansh47 wants to merge 1 commit into
Open
Warn when zero.Init silently falls back to a single rank (#8084)#8089akshansh47 wants to merge 1 commit into
akshansh47 wants to merge 1 commit into
Conversation
…#8084) When a multi-process launcher sets WORLD_SIZE>1 but the distributed process group is not initialized before zero.Init runs (e.g. from_pretrained before deepspeed.init_distributed()), the resolved group collapses to a single rank. zero.Init then materializes every parameter whole on every rank instead of partitioning, so each rank loads the full model and OOMs with no diagnostic. Detect this case and emit an actionable warning pointing at the missing init_distributed() call. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #8084.
deepspeed.zero.Initresolves its partition group fromdist.get_world_group(). If the distributed process group has not been initialized beforezero.Initruns (the classic case:AutoModel.from_pretrained(...)under an activeHfDeepSpeedConfigexecutes beforedeepspeed.init_distributed()), the call at the top ofInit.__init__:ends up with a process group that only sees the local rank, so
self.dp_world_size == 1.zero.Initthen materializes every parameter whole on every rank instead of partitioning it. Underdeepspeed --num_gpus Nevery rank loads the full (unsharded) model and OOMs. The failure is silent and indistinguishable from an honest "model too big" OOM, which makes it very hard to diagnose.Fix
Detect the case and emit a loud, actionable warning: the launcher reports
WORLD_SIZE > 1but the resolved group collapsed to a single rank. The warning names the likely cause and the fix (calldeepspeed.init_distributed()before building the model underzero.Init).This is intentionally a warning, not an error or an auto-fix:
data_parallel_groupis treated as intentional and never warns.WORLD_SIZEunset or1) never warns.Detection is factored into a small pure helper
_unsharded_single_rank_warning(...)so it is unit-testable without a GPU or a live process group.Tests
tests/unit/runtime/zero/test_zero_init_unsharded_warning.pycovers:WORLD_SIZE>1but group is single-rank,WORLD_SIZEunset /1),dp_world_size>1),data_parallel_groupis supplied,WORLD_SIZEvalues do not raise.Made with Cursor