Add AutoEP + AutoTP parallel folding#8064
Conversation
Allow tensor parallelism (AutoTP) for the dense/attention path to coexist with expert parallelism (AutoEP) for routed experts on the same rank set, without requiring EP to be a subset of DP. - Treat dense and MoE as independent partitionings: dense view tp*dp, expert view ep*etp*edp, with dp/edp derived so tp*dp == ep*etp*edp == stage_size. expert_tensor_parallel_size is reserved (must currently be 1). - Express folding via the existing tensor_parallel/expert_parallel config sections, with divisibility, TP/sequence-parallel exclusivity, and preset_model consistency validation. - Add the route-full / partition-dispatch MoE path and AutoTP skipping of AutoEP subtrees; derive folded process groups via the generalized expert/data-parallel group creation. - Reduce TP-replicated router/gate gradients mode-aware (sum when tokens are partitioned, average when replicated); record per-parameter-family ZeRO checkpoint metadata and handle folded ZeRO-1/2 optimizer state. - Add folding unit tests (config, groups, dispatch, runtime, gradient parity, checkpoint), including multi-rank GPU-gated cases. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 278c919489
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| chunks = torch.split(grad_output, ctx.counts, dim=0) | ||
| grad_padded = grad_output.new_zeros((ctx.max_rows, *grad_output.shape[1:])) | ||
| if local_count: | ||
| grad_padded[:local_count].copy_(chunks[ctx.group_rank]) | ||
| return grad_padded[:local_count].contiguous(), None, None, None |
There was a problem hiding this comment.
Sum gathered-row gradients across TP lanes
When folded MoE output is consumed differently on each TP lane (for example by a row-parallel/lm-head layer that slices the hidden dimension), every gathered row participates in the loss on every lane. This backward path only returns chunks[ctx.group_rank] from the local rank's grad_output, so contributions from peer lanes to this rank's local expert outputs and routing weights are dropped; the padded local gradient needs to be accumulated across ctx.group before returning.
Useful? React with 👍 / 👎.
| grad_reduc = self.get_gradient_for_reduction(param) | ||
| self._maybe_reduce_autoep_folding_tp_gradient(param, grad_reduc) |
There was a problem hiding this comment.
Honor ds_grad_is_ready before TP reduction
In ZeRO-2 folded runs, parameters with ds_grad_is_ready=False are intentionally skipped until their transient/tiled gradient is complete, as the guard immediately below documents. Calling the new TP reduction before that guard mutates and all-reduces incomplete gradients for those parameters, which can corrupt the final accumulated gradient once the ready shard is eventually reduced.
Useful? React with 👍 / 👎.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
| "is planned as follow-up work.") | ||
|
|
||
| expert_width = spec.ep_size * spec.etp_size | ||
| if spec.tp_size > 1 and expert_width > spec.dp_size: |
There was a problem hiding this comment.
What will happen if expert_width % spec.dp_size != 0?
There was a problem hiding this comment.
Good question! Your comment made me realize this was not only a remainder case, but also that the original version was still enforcing a stricter layout than Parallel Folding actually requires.
I generalized the layout after that: EP no longer has to be nested inside dense DP, so the previous checks around expert_width > dp_size and dp_size % expert_width != 0 are removed. In particular, expert_width does not need to divide dp_size; EP/EDP groups are derived over the per-stage rank set instead.
The remaining expert-side divisibility requirement is that the expert width tiles the stage:
stage_size % (ep * etp) == 0
If that does not hold, we still reject it fail-fast because edp cannot be derived cleanly. Otherwise dp and edp are derived under:
tp * dp == ep * etp * edp == stage_size
Examples that were previously rejected and now run include world=4 TP4/EP4 (dp=1, edp=1), world=4 TP2/EP4 (dp=2, edp=1), and world=8 TP4/EP4 (dp=2, edp=2).
| if not param.requires_grad or param.grad is None: | ||
| continue | ||
| if is_moe_param(param) or is_model_parallel_parameter(param): | ||
| continue |
There was a problem hiding this comment.
This filter cannot distingush router grads from laynorm grads. The router grads needs SUM because of dispatch of token to experts, but laynorm does not need SUM, so they need different reduce strategy.
There was a problem hiding this comment.
Here is the test, I verified on a CPU system that this test case cannot pass.
# Copyright (c) DeepSpeed Team.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
"""Engine-path (zero_stage=0) parity: does folded layernorm/gate match the non-folded
DP baseline in the FULL flow? (The author only tested ZeRO-2 parity; the engine path
that runs at zero_stage=0/1 is untested.) CPU/Gloo, world=8.
"""
import deepspeed
from unit.v1.moe.autoep_test_utils import make_autoep_config, run_cpu_gloo_test, seed_everything
from unit.v1.moe.test_autoep_autotp_grad_parity import (
_router_grad_model,
_run_router_grad_boundary,
_full_grad_by_suffix,
)
GATE_BASELINE = "model.layers.0.mlp.gate.weight"
GATE_FOLDED = "model.layers.0.mlp.router.gate.weight"
LN = "model.layers.0.input_layernorm.weight"
def _baseline_cfg():
c = {k: v for k, v in make_autoep_config(zero_stage=0, ep_size=1, mixed_precision=False).items()
if k != "expert_parallel"}
c["gradient_accumulation_steps"] = 2
c["gradient_clipping"] = 0.0
c["communication_data_type"] = "fp32"
c["optimizer"]["params"]["torch_adam"] = True
return c
def _folded_cfg():
c = make_autoep_config(zero_stage=0, ep_size=4, mixed_precision=False)
c["gradient_accumulation_steps"] = 2
c["gradient_clipping"] = 0.0
c["communication_data_type"] = "fp32"
c["optimizer"]["params"]["torch_adam"] = True
c["expert_parallel"]["autoep_size"] = 4
c["tensor_parallel"] = {"autotp_size": 2, "partition_config": {
"use_default_specs": False, "layer_specs": [{"patterns": [r".*\.weight$"], "partition_type": "skip"}]}}
return c
def _worker(rank, world_size, tmpdir):
seed = 1234
tp_size = 2
logical_dp_world_size = world_size // tp_size
logical_dp_rank = rank // tp_size
seed_everything(seed)
reference_state = _router_grad_model().state_dict()
baseline_model = _router_grad_model()
baseline_model.load_state_dict(reference_state)
baseline_engine, *_ = deepspeed.initialize(model=baseline_model, config=_baseline_cfg())
_run_router_grad_boundary(baseline_engine,
logical_dp_world_size=logical_dp_world_size,
logical_dp_rank=logical_dp_rank,
seed=seed)
base_gate = _full_grad_by_suffix(baseline_engine, GATE_BASELINE)
base_ln = _full_grad_by_suffix(baseline_engine, LN)
folded_model = _router_grad_model()
folded_model.load_state_dict(reference_state)
folded_engine, *_ = deepspeed.initialize(model=folded_model, config=_folded_cfg())
_run_router_grad_boundary(folded_engine,
logical_dp_world_size=logical_dp_world_size,
logical_dp_rank=logical_dp_rank,
seed=seed)
folded_gate = _full_grad_by_suffix(folded_engine, GATE_FOLDED)
folded_ln = _full_grad_by_suffix(folded_engine, LN)
gate_ratio = (folded_gate.norm() / base_gate.norm()).item()
ln_ratio = (folded_ln.norm() / base_ln.norm()).item()
print(f"[rank {rank}] ENGINE(zero0) gate_ratio={gate_ratio:.4f} ln_ratio={ln_ratio:.4f}")
if rank == 0:
assert abs(gate_ratio - 1.0) <= 5e-3, f"gate parity: {gate_ratio}"
assert abs(ln_ratio - 1.0) <= 5e-3, f"ln parity: {ln_ratio}"
def test_b1_engine_path_parity(tmpdir):
run_cpu_gloo_test(_worker, tmpdir, world_size=8)
There was a problem hiding this comment.
Thanks for the precise diagnosis! You are right that router/gate and LayerNorm gradients cannot be safely distinguished by that old filter. I changed the reduction logic so it is keyed to each parameter family’s actual replication structure instead.
For the current folded AutoEP+AutoTP path, both router/gate and LayerNorm end up using AVERAGE over the TP token-replication group, but for different reasons. LayerNorm is simply TP-replicated. Router/gate is different: we drop the duplicated TP activations before MoE dispatch, then reconstruct the full TP-replicated token view after MoE with restore_combined; the backward of that reconstruction introduces a tp_size factor, and the TP AVERAGE reduction cancels it. The SUM path is kept for the token-partial / true-SP cases, but it is not the live router/gate path here.
While fixing this, I also found a separate scaling issue for routed-expert weights. Their gradients go through the same restore_combined backward factor, but they had been classified as SKIP, so the tp_size factor reached the optimizer. That made folded expert gradients tp_size too large. The fix now cancels that factor by dividing routed-expert grads by tp_size without adding a TP all-reduce; it composes with the existing EDP all-reduce.
Your zero_stage=0 reproducer now passes on the head of this PR branch. I also added parity coverage against the non-folded baseline for router/gate, LayerNorm, and routed-expert weights across TP2/EP4 and TP4/EP4 folded shapes, including ZeRO 0 and ZeRO 2.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Comment-only; no behavior change. Make the folded TP/SP gradient SUM-vs-AVERAGE policy and its reasoning explicit in the code: - mark_autoep_folding_* contracts: router_gate_replicated is the only marker applied on the live forward path (AVERAGE); routed-token-partial and SP-sharded-LayerNorm are future-only SUM contracts pinned by unit tests. - _AllGatherVariableRows / restore_combined: document the tp_size factor the restore all-gather backward injects, which makes the folded router/gate arrive replicated (AVERAGE) and why SUM reproduces the 2.0x parity regression the CPU/Gloo tests guard. - partition_assignments: the drop-before-all-to-all keeps EP dispatch at 1x volume, and the reconstruction is what makes the gradient replicated. - Strategy classifier: the router-vs-LayerNorm asymmetry (router can be a tp/sp partial because it rides the dispatch all-to-all; LayerNorm only under true SP) and the underlying gathered=AVERAGE / sharded=SUM rule. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Lift the temporary fail-fast that forced expert parallelism to be a subset of data parallelism (expert_width = ep*etp <= dp and dp % expert_width == 0) and support cross-lane EP, where an EP group may span TP lanes and dense-DP ranks (e.g. world=4 TP4/EP4/dp1, or world=4 TP2/DP2/EP4). The folded group tables already lay EP groups across the tp-lane-major rank ordering, so only the validation gate and the routed-expert gradient reduction needed to change. Fix routed-expert gradient over-scaling under folding. The folded forward all-gathers expert outputs into a replicated full view in restore_combined, whose backward injects a tp_size factor (the same factor the replicated router cancels via AVERAGE). Routed experts were classified SKIP, so that factor survived and the expert weight gradient reached the optimizer tp_size times too large. This was invisible to scale-invariant Adam but real for non-adaptive optimizers and for gradient clipping (it inflates the expert contribution to the global grad norm), and it grows with tp_size in cross-lane shapes. Routed experts now use a dedicated EXPERT_TP_CANCEL reduction that divides the gradient by tp_size with no TP all_reduce; the division is linear and composes with the existing expert-data-parallel all_reduce in any order. The per-family gradient convention is otherwise unchanged and is now justified for the general cross-lane layout: each family's reduction is keyed to its replication structure, not the EP layout. Router/gate and dense/LayerNorm AVERAGE over the TP (token-replication) group; routed experts cancel the restore tp_size factor and reduce data-parallel over the EDP group. Add CPU/Gloo parity coverage for cross-lane shapes (router/gate and LayerNorm vs a non-folded baseline at world=4 TP2/EP4 and TP4/EP4/dp1) and SGD-based expert-weight parity (post-step weight equality vs the non-folded baseline) for the MVP shape and cross-lane shapes including edp>1. Adam is scale-invariant, so the expert test uses SGD to actually exercise the gradient magnitude. Update the folding config/group-table tests: cross-lane shapes are now accepted, and the cross-lane EP/EDP rank tables are asserted. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
@tohtana Thanks for the extensive amount of work, super appreciated. A couple of questions (I will also run some tests on my end):
|
This PR adds parallel folding for AutoEP: tensor parallelism (AutoTP) for the dense/attention path can now coexist with expert parallelism (AutoEP) for the routed-expert path on the same set of ranks, without forcing EP to be a subset of DP — an EP group may span TP lanes and dense-DP ranks (cross-lane EP).
(This PR should be adjusted for ZeRO3 support after #8060 is merged)
Design
Attention/dense and MoE are treated as two independent partitionings of the same rank set, parameterized per parameter family:
stage_size = tp * dpstage_size = ep * etp * edpdpandedpare always derived, never user-configured, so the invarianttp * dp == ep * etp * edp == stage_sizecannot be broken from config. The only structural requirement is that the expert width tiles the stage (stage_size % (ep * etp) == 0); EP groups are then laid across a TP-lane-major rank ordering, so they may span TP lanes and dense-DP ranks.Configuration
No new config section. Folding is expressed by the coexistence of the existing
tensor_parallelandexpert_parallelsections:{ "tensor_parallel": { "autotp_size": 4 }, "expert_parallel": { "enabled": true, "autoep_size": 4, "expert_tensor_parallel_size": 1 } }expert_tensor_parallel_sizeis carried as a config field but currently must be1(expert-internal TP is reserved as follow-up and rejected fail-fast). Validation enforces stage divisibility, TP/sequence-parallel exclusivity, andpreset_modelconsistency between the two sections.Cross-lane expert parallelism
Expert parallelism no longer has to be a subset of data parallelism. Shapes where the expert width exceeds (or does not divide) the dense data-parallel size are supported, for example:
world=4, TP=4, EP=4(dp=1): the EP group is the whole TP group — one expert per rank.world=4, TP=2, EP=4(dp=2): the EP group spans both TP lanes and both DP ranks.world=8, TP=4, EP=4(dp=2, edp=2): EP groups span TP lanes with expert replication.The per-family gradient convention is keyed to each parameter's replication structure, not to the EP layout, so it holds across the whole
tp*dppool:restore_combined, whose all-gather backward injects atp_sizefactor that AVERAGE divides out.restore_combinedtp_sizefactor (divide bytp_size, no TP all-reduce) and reduce data-parallel over the expert-data-parallel (EDP) group. Without the cancellation, folded expert gradients are over-scaled bytp_size— invisible to scale-invariant Adam, but real for non-adaptive optimizers and for gradient clipping (it inflates the expert contribution to the global grad norm). This is now fixed for all folded shapes (the MVP TP2×EP4 shape included).What's included
mp_modeTP-strided vs SP-consecutive ordering), including cross-lane EP group tables.deepspeed/moe/ep_tp_dispatch.py), with AutoTP skipping AutoEP subtrees.tp_sizecancellation for routed experts; SKIP for genuinely TP-sharded params; SUM contracts reserved for a future true sequence-parallel path.Correctness & validation
ep>dp) and TP4×EP4 (4-rank,dp=1); scale 1.0.edp=1andedp=2.edp=1andedp=2); cross-lane folded training runs with finite loss and finite, non-zero expert/router gradients.aws-torch-latest-full) on H100 GPUs.Scope / follow-ups
etp=1. The replicated-grad reduction is mode-aware so the sequence-parallel (Ulysses) folding case fits the same contract; AutoTP + AutoEP is the validated path here.expert_tensor_parallel_size > 1) is reserved for a follow-up.