Skip to content

Warn when zero.Init silently falls back to a single rank (#8084)#8089

Open
akshansh47 wants to merge 1 commit into
deepspeedai:masterfrom
akshansh47:fix/zero-init-unsharded-single-rank-warning
Open

Warn when zero.Init silently falls back to a single rank (#8084)#8089
akshansh47 wants to merge 1 commit into
deepspeedai:masterfrom
akshansh47:fix/zero-init-unsharded-single-rank-warning

Conversation

@akshansh47

Copy link
Copy Markdown

Problem

Fixes #8084.

deepspeed.zero.Init resolves its partition group from dist.get_world_group(). If the distributed process group has not been initialized before zero.Init runs (the classic case: AutoModel.from_pretrained(...) under an active HfDeepSpeedConfig executes before deepspeed.init_distributed()), the call at the top of Init.__init__:

if not dist.is_initialized():
    init_distributed()

ends up with a process group that only sees the local rank, so self.dp_world_size == 1. zero.Init then materializes every parameter whole on every rank instead of partitioning it. Under deepspeed --num_gpus N every rank loads the full (unsharded) model and OOMs. The failure is silent and indistinguishable from an honest "model too big" OOM, which makes it very hard to diagnose.

Fix

Detect the case and emit a loud, actionable warning: the launcher reports WORLD_SIZE > 1 but the resolved group collapsed to a single rank. The warning names the likely cause and the fix (call deepspeed.init_distributed() before building the model under zero.Init).

This is intentionally a warning, not an error or an auto-fix:

  • An explicitly supplied size-1 data_parallel_group is treated as intentional and never warns.
  • A genuine single-process run (WORLD_SIZE unset or 1) never warns.
  • It does not change any partitioning behavior, so it cannot break a working setup.

Detection is factored into a small pure helper _unsharded_single_rank_warning(...) so it is unit-testable without a GPU or a live process group.

Tests

tests/unit/runtime/zero/test_zero_init_unsharded_warning.py covers:

  • warns when launcher WORLD_SIZE>1 but group is single-rank,
  • no warning for a genuine single-process run (WORLD_SIZE unset / 1),
  • no warning when the group actually shards (dp_world_size>1),
  • no warning when an explicit data_parallel_group is supplied,
  • malformed WORLD_SIZE values do not raise.

Made with Cursor

…#8084)

When a multi-process launcher sets WORLD_SIZE>1 but the distributed
process group is not initialized before zero.Init runs (e.g.
from_pretrained before deepspeed.init_distributed()), the resolved group
collapses to a single rank. zero.Init then materializes every parameter
whole on every rank instead of partitioning, so each rank loads the full
model and OOMs with no diagnostic. Detect this case and emit an
actionable warning pointing at the missing init_distributed() call.

Co-authored-by: Cursor <cursoragent@cursor.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zero.Init silently does not shard (world_size=1) when the process group is uninitialized before from_pretrained -> per-rank full load -> OOM

1 participant