Warn when zero.Init silently falls back to a single rank (#8084) by akshansh47 · Pull Request #8089 · deepspeedai/DeepSpeed

akshansh47 · 2026-06-24T18:59:42Z

Problem

deepspeed.zero.Init resolves its partition group from dist.get_world_group(). If the distributed process group has not been initialized before zero.Init runs (the classic case: AutoModel.from_pretrained(...) under an active HfDeepSpeedConfig executes before deepspeed.init_distributed()), the call at the top of Init.__init__:

if not dist.is_initialized():
    init_distributed()

ends up with a process group that only sees the local rank, so self.dp_world_size == 1. zero.Init then materializes every parameter whole on every rank instead of partitioning it. Under deepspeed --num_gpus N every rank loads the full (unsharded) model and OOMs. The failure is silent and indistinguishable from an honest "model too big" OOM, which makes it very hard to diagnose.

Fix

Detect the case and emit a loud, actionable warning: the launcher reports WORLD_SIZE > 1 but the resolved group collapsed to a single rank. The warning names the likely cause and the fix (call deepspeed.init_distributed() before building the model under zero.Init).

This is intentionally a warning, not an error or an auto-fix:

An explicitly supplied size-1 data_parallel_group is treated as intentional and never warns.
A genuine single-process run (WORLD_SIZE unset or 1) never warns.
It does not change any partitioning behavior, so it cannot break a working setup.

Detection is factored into a small pure helper _unsharded_single_rank_warning(...) so it is unit-testable without a GPU or a live process group.

Tests

tests/unit/runtime/zero/test_zero_init_unsharded_warning.py covers:

warns when launcher WORLD_SIZE>1 but group is single-rank,
no warning for a genuine single-process run (WORLD_SIZE unset / 1),
no warning when the group actually shards (dp_world_size>1),
no warning when an explicit data_parallel_group is supplied,
malformed WORLD_SIZE values do not raise.

Made with Cursor

…#8084) When a multi-process launcher sets WORLD_SIZE>1 but the distributed process group is not initialized before zero.Init runs (e.g. from_pretrained before deepspeed.init_distributed()), the resolved group collapses to a single rank. zero.Init then materializes every parameter whole on every rank instead of partitioning, so each rank loads the full model and OOMs with no diagnostic. Detect this case and emit an actionable warning pointing at the missing init_distributed() call. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector · 2026-06-24T18:59:48Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

akshansh47 requested review from loadams, tjruwase and tohtana as code owners June 24, 2026 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Warn when zero.Init silently falls back to a single rank (#8084)#8089

Warn when zero.Init silently falls back to a single rank (#8084)#8089
akshansh47 wants to merge 1 commit into
deepspeedai:masterfrom
akshansh47:fix/zero-init-unsharded-single-rank-warning

akshansh47 commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

akshansh47 commented Jun 24, 2026

Problem

Fix

Tests

Uh oh!

chatgpt-codex-connector Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant