Enable bf16 check_grad_overflow by default (matching fp16)#8035
Enable bf16 check_grad_overflow by default (matching fp16)#8035yongzhe-wang wants to merge 1 commit into
Conversation
|
Thank you @yongzhe-wang, What is your thought? @sfc-gh-truwase |
|
Thanks @tohtana — good question. A few things that I think bound the sync-cost concern:
if self.bfloat16_enabled():
check_grad_overflow =
self._config.bfloat16_config.check_grad_overflow
elif self.fp16_enabled():
check_grad_overflow = True # fp16 always pays this
else:
check_grad_overflow = FalseThe work it gates — has_overflow(): an isfinite scan over the gradient partition, a scalar all_reduce(MAX), and one .item() — is the same code path fp16 has run by default for years. So this is a well-characterized production cost, not a new one; the PR just gives bf16 the same protection.
|
| Check for gradient overflows and underflows | ||
| check_grad_overflow: bool = True | ||
| """ | ||
| Detect gradient overflow/underflow before optimizer step and skip the step |
There was a problem hiding this comment.
Make this terse, and move the details to the PR. Okay to leave the issue and PR links here.
|
@yongzhe-wang, apologies for the delayed response. Thanks for this PR. Similar to @tohtana's point, performance concern was what prevented making this default. But, I think your arguments are solid: (1) parity with known production-cost of fp16, and (2) potentially reusing the current grad norm scans to reduce cost. I left one minor code cleanup comment. |
Signed-off-by: Yongzhe Wang <yzwang2020@gmail.com>
46dcf7b to
90c4c8b
Compare
|
@yongzhe-wang Let me know when you think this PR is ready to merge. |
Summary
Flip
DeepSpeedBF16Config.check_grad_overflowdefault fromFalsetoTrue, so bf16 users get the same gradient-overflow protection that fp16 users already get by default.Motivation
The bf16 documentation states bf16 "does not require loss scaling" (deepspeed.ai/docs/config-json/), but this overstates the safety guarantee for the bf16 + ZeRO-2 (non-offload) partition-flat gradient accumulation path. We reproduced a deterministic catastrophic NaN under a small set of training conditions:
Under these conditions, a single bf16 element in
engine.optimizer.averaged_gradients[i]overflows to+inf. The downstreamAdam.stepthen computesinf / sqrt(inf) = NaNin a fused kernel, which simultaneously corrupts the partition slice'sexp_avg,exp_avg_sq, and fp32 master weights. The next forward pass propagates NaN through every layer; the training run is dead with no useful diagnostic. Reproduced consistently in DeepSpeed 0.16.9 - 0.17.1 at step ~22 with our internal repro.The infrastructure to detect and skip such steps was correctly added by #6976 (
check_grad_overflowoption,DeepSpeedZeroOptimizer.check_overflowmethod, and step-skip logic atstage_1_and_2.step()lines ~2128-2143). However the default was set toFalsefor bf16, so users hitting this condition do not receive the protection.Change
Single line:
check_grad_overflow: bool = False->check_grad_overflow: bool = TrueinDeepSpeedBF16Config. Updated docstring + bf16 example block accordingly.Backward compatibility
Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting:
```json
"bf16": {
"enabled": true,
"check_grad_overflow": false
}
```
The runtime cost is one isfinite-style scan over the gradient partition per optimizer step (already implemented in
DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock.Related
check_grad_overflowoption and underlying skip logicTest plan
False, run dies at step ~22; withTrue, training survives via DeepSpeed's existing skip-step path.precision_config.py.