Enable bf16 check_grad_overflow by default (matching fp16) by yongzhe-wang · Pull Request #8035 · deepspeedai/DeepSpeed

yongzhe-wang · 2026-05-29T03:38:21Z

Summary

Flip DeepSpeedBF16Config.check_grad_overflow default from False to True, so bf16 users get the same gradient-overflow protection that fp16 users already get by default.

Motivation

The bf16 documentation states bf16 "does not require loss scaling" (deepspeed.ai/docs/config-json/), but this overstates the safety guarantee for the bf16 + ZeRO-2 (non-offload) partition-flat gradient accumulation path. We reproduced a deterministic catastrophic NaN under a small set of training conditions:

ZeRO-2 (non-offload) + bf16
Mixture-of-Transformers (modality-specific transformer branches)
Heterogeneous per-sample loss masks (e.g. 50% action-invalid samples in robotics VLA training)

Under these conditions, a single bf16 element in engine.optimizer.averaged_gradients[i] overflows to +inf. The downstream Adam.step then computes inf / sqrt(inf) = NaN in a fused kernel, which simultaneously corrupts the partition slice's exp_avg, exp_avg_sq, and fp32 master weights. The next forward pass propagates NaN through every layer; the training run is dead with no useful diagnostic. Reproduced consistently in DeepSpeed 0.16.9 - 0.17.1 at step ~22 with our internal repro.

The infrastructure to detect and skip such steps was correctly added by #6976 (check_grad_overflow option, DeepSpeedZeroOptimizer.check_overflow method, and step-skip logic at stage_1_and_2.step() lines ~2128-2143). However the default was set to False for bf16, so users hitting this condition do not receive the protection.

Change

Single line: check_grad_overflow: bool = False -> check_grad_overflow: bool = True in DeepSpeedBF16Config. Updated docstring + bf16 example block accordingly.

Backward compatibility

Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting:

```json
"bf16": {
"enabled": true,
"check_grad_overflow": false
}
```

The runtime cost is one isfinite-style scan over the gradient partition per optimizer step (already implemented in DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock.

Test plan

Reproducer (private repo): with default False, run dies at step ~22; with True, training survives via DeepSpeed's existing skip-step path.
Existing CI should pass unchanged; this PR only changes a default value in precision_config.py.

tohtana · 2026-06-02T21:19:15Z

Thank you @yongzhe-wang,
This change overall looks good to me, but I'm still not sure about the performance impact. Adding a synchronization point might bring a noticeable difference.

What is your thought? @sfc-gh-truwase

yongzhe-wang · 2026-06-09T02:27:25Z

Thanks @tohtana — good question. A few things that I think bound the sync-cost concern:

This isn't a new synchronization point for DeepSpeed — fp16 already does exactly this, unconditionally, every step. In engine.py the flag is hard-wired on for fp16:

if self.bfloat16_enabled():
    check_grad_overflow =
self._config.bfloat16_config.check_grad_overflow
elif self.fp16_enabled():
    check_grad_overflow = True      # fp16 always pays this
else:
    check_grad_overflow = False

The work it gates — has_overflow(): an isfinite scan over the gradient partition, a scalar all_reduce(MAX), and one .item() — is the same code path fp16 has run by default for years. So this is a well-characterized production cost, not a new one; the PR just gives bf16 the same protection.

ZeRO-1/2 already incurs a per-step device sync regardless. In stage_1_and_2.step(), every non-overflow step calls scaled_global_norm() for gradient clipping, which does an all_reduce + .item(). The engine therefore isn't running fully async across the optimizer step anyway — the overflow check's .item() lands just before a sync that was already going to happen.

sfc-gh-truwase · 2026-06-23T23:21:57Z

-    Check for gradient overflows and underflows
+    check_grad_overflow: bool = True
+    """
+    Detect gradient overflow/underflow before optimizer step and skip the step


Make this terse, and move the details to the PR. Okay to leave the issue and PR links here.

sfc-gh-truwase · 2026-06-23T23:23:58Z

@yongzhe-wang, apologies for the delayed response. Thanks for this PR.

Similar to @tohtana's point, performance concern was what prevented making this default. But, I think your arguments are solid: (1) parity with known production-cost of fp16, and (2) potentially reusing the current grad norm scans to reduce cost.

I left one minor code cleanup comment.

Signed-off-by: Yongzhe Wang <yzwang2020@gmail.com>

tohtana · 2026-06-25T04:09:58Z

@yongzhe-wang Let me know when you think this PR is ready to merge.

yongzhe-wang requested review from tjruwase and tohtana as code owners May 29, 2026 03:38

sfc-gh-truwase reviewed Jun 23, 2026

View reviewed changes

sfc-gh-truwase closed this Jun 23, 2026

sfc-gh-truwase reopened this Jun 23, 2026

sfc-gh-truwase approved these changes Jun 23, 2026

View reviewed changes

Enable bf16 check_grad_overflow by default (matching fp16)

90c4c8b

Signed-off-by: Yongzhe Wang <yzwang2020@gmail.com>

yongzhe-wang force-pushed the fix/bf16-check-grad-overflow-default-true branch from 46dcf7b to 90c4c8b Compare June 25, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable bf16 check_grad_overflow by default (matching fp16)#8035

Enable bf16 check_grad_overflow by default (matching fp16)#8035
yongzhe-wang wants to merge 1 commit into
deepspeedai:masterfrom
yongzhe-wang:fix/bf16-check-grad-overflow-default-true

yongzhe-wang commented May 29, 2026

Uh oh!

tohtana commented Jun 2, 2026

Uh oh!

yongzhe-wang commented Jun 9, 2026 •

edited

Loading

Uh oh!

sfc-gh-truwase Jun 23, 2026

Uh oh!

sfc-gh-truwase commented Jun 23, 2026

Uh oh!

tohtana commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yongzhe-wang commented May 29, 2026

Summary

Motivation

Change

Backward compatibility

Related

Test plan

Uh oh!

tohtana commented Jun 2, 2026

Uh oh!

yongzhe-wang commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase commented Jun 23, 2026

Uh oh!

tohtana commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yongzhe-wang commented Jun 9, 2026 •

edited

Loading