Add AutoEP + AutoTP parallel folding by tohtana · Pull Request #8064 · deepspeedai/DeepSpeed

tohtana · 2026-06-13T17:43:43Z

This PR adds parallel folding for AutoEP: tensor parallelism (AutoTP) for the dense/attention path can now coexist with expert parallelism (AutoEP) for the routed-expert path on the same set of ranks, without forcing EP to be a subset of DP — an EP group may span TP lanes and dense-DP ranks (cross-lane EP).
(This PR should be adjusted for ZeRO3 support after #8060 is merged)

Design

Attention/dense and MoE are treated as two independent partitionings of the same rank set, parameterized per parameter family:

Dense / attention / shared-expert params: stage_size = tp * dp
Routed-expert params: stage_size = ep * etp * edp

dp and edp are always derived, never user-configured, so the invariant tp * dp == ep * etp * edp == stage_size cannot be broken from config. The only structural requirement is that the expert width tiles the stage (stage_size % (ep * etp) == 0); EP groups are then laid across a TP-lane-major rank ordering, so they may span TP lanes and dense-DP ranks.

Configuration

No new config section. Folding is expressed by the coexistence of the existing tensor_parallel and expert_parallel sections:

{
  "tensor_parallel":  { "autotp_size": 4 },
  "expert_parallel":  { "enabled": true, "autoep_size": 4,
                        "expert_tensor_parallel_size": 1 }
}

expert_tensor_parallel_size is carried as a config field but currently must be 1 (expert-internal TP is reserved as follow-up and rejected fail-fast). Validation enforces stage divisibility, TP/sequence-parallel exclusivity, and preset_model consistency between the two sections.

Cross-lane expert parallelism

Expert parallelism no longer has to be a subset of data parallelism. Shapes where the expert width exceeds (or does not divide) the dense data-parallel size are supported, for example:

world=4, TP=4, EP=4 (dp=1): the EP group is the whole TP group — one expert per rank.
world=4, TP=2, EP=4 (dp=2): the EP group spans both TP lanes and both DP ranks.
world=8, TP=4, EP=4 (dp=2, edp=2): EP groups span TP lanes with expert replication.

The per-family gradient convention is keyed to each parameter's replication structure, not to the EP layout, so it holds across the whole tp*dp pool:

Router/gate and dense/LayerNorm are AVERAGE over the TP (token-replication) group. The folded router runs redundantly on every TP peer; its partitioned work is reconstructed into a replicated full view by restore_combined, whose all-gather backward injects a tp_size factor that AVERAGE divides out.
Routed experts cancel that same restore_combined tp_size factor (divide by tp_size, no TP all-reduce) and reduce data-parallel over the expert-data-parallel (EDP) group. Without the cancellation, folded expert gradients are over-scaled by tp_size — invisible to scale-invariant Adam, but real for non-adaptive optimizers and for gradient clipping (it inflates the expert contribution to the global grad norm). This is now fixed for all folded shapes (the MVP TP2×EP4 shape included).

What's included

Folded process-group derivation using the generalized expert/data-parallel group creation (mp_mode TP-strided vs SP-consecutive ordering), including cross-lane EP group tables.
Route-full / partition-dispatch path for folded MoE (deepspeed/moe/ep_tp_dispatch.py), with AutoTP skipping AutoEP subtrees.
Per-family folded gradient reduction: AVERAGE for replicated router/gate and dense/LayerNorm; a dedicated tp_size cancellation for routed experts; SKIP for genuinely TP-sharded params; SUM contracts reserved for a future true sequence-parallel path.
Per-parameter-family ZeRO checkpoint metadata (routed-expert vs dense/router/shared placement) and folded ZeRO-1/2 optimizer-state handling.

Correctness & validation

Router/gate and LayerNorm gradient parity against a non-folded ZeRO baseline (atol=1e-1, rtol=5e-3, fp32), on TP2×EP4 (8-rank) and the cross-lane shapes TP2×EP4 (4-rank, ep>dp) and TP4×EP4 (4-rank, dp=1); scale 1.0.
Routed-expert weight parity against a non-folded baseline, verified with SGD (Adam is scale-invariant and would mask a uniform gradient-scale error), for the MVP TP2×EP4 shape and cross-lane TP4×EP4 with edp=1 and edp=2.
New folding unit tests for config, cross-lane group layout, dispatch, runtime, gradient parity, and checkpoint save/load (multi-rank GPU cases gated for GPU runners; CPU/Gloo parity runs on CI).
Real-H100 confirmation (8×H100): router/gate, LayerNorm, and routed-expert gradient parity to the non-folded baseline hold (scale 1.0) for MVP TP2×EP4 and cross-lane TP4×EP4 (edp=1 and edp=2); cross-lane folded training runs with finite loss and finite, non-zero expert/router gradients.
Passes the full unit test suite (aws-torch-latest-full) on H100 GPUs.

Scope / follow-ups

This PR covers AutoEP + AutoTP folding, including cross-lane EP with etp=1. The replicated-grad reduction is mode-aware so the sequence-parallel (Ulysses) folding case fits the same contract; AutoTP + AutoEP is the validated path here.
Expert-internal tensor parallelism (expert_tensor_parallel_size > 1) is reserved for a follow-up.
ZeRO-3 composition with folding is planned as separate follow-up work (after Support AutoEP with ZeRO-3 zero.Init source modules #8060 is merged).

Allow tensor parallelism (AutoTP) for the dense/attention path to coexist with expert parallelism (AutoEP) for routed experts on the same rank set, without requiring EP to be a subset of DP. - Treat dense and MoE as independent partitionings: dense view tp*dp, expert view ep*etp*edp, with dp/edp derived so tp*dp == ep*etp*edp == stage_size. expert_tensor_parallel_size is reserved (must currently be 1). - Express folding via the existing tensor_parallel/expert_parallel config sections, with divisibility, TP/sequence-parallel exclusivity, and preset_model consistency validation. - Add the route-full / partition-dispatch MoE path and AutoTP skipping of AutoEP subtrees; derive folded process groups via the generalized expert/data-parallel group creation. - Reduce TP-replicated router/gate gradients mode-aware (sum when tokens are partitioned, average when replicated); record per-parameter-family ZeRO checkpoint metadata and handle folded ZeRO-1/2 optimizer state. - Add folding unit tests (config, groups, dispatch, runtime, gradient parity, checkpoint), including multi-rank GPU-gated cases. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 278c919489

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-13T17:51:03Z

+        chunks = torch.split(grad_output, ctx.counts, dim=0)
+        grad_padded = grad_output.new_zeros((ctx.max_rows, *grad_output.shape[1:]))
+        if local_count:
+            grad_padded[:local_count].copy_(chunks[ctx.group_rank])
+        return grad_padded[:local_count].contiguous(), None, None, None


Sum gathered-row gradients across TP lanes

When folded MoE output is consumed differently on each TP lane (for example by a row-parallel/lm-head layer that slices the hidden dimension), every gathered row participates in the loss on every lane. This backward path only returns chunks[ctx.group_rank] from the local rank's grad_output, so contributions from peer lanes to this rank's local expert outputs and routing weights are dropped; the padded local gradient needs to be accumulated across ctx.group before returning.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-13T17:51:03Z

        grad_reduc = self.get_gradient_for_reduction(param)
+        self._maybe_reduce_autoep_folding_tp_gradient(param, grad_reduc)


Honor ds_grad_is_ready before TP reduction

In ZeRO-2 folded runs, parameters with ds_grad_is_ready=False are intentionally skipped until their transient/tiled gradient is complete, as the guard immediately below documents. Calling the new TP reduction before that guard mutates and all-reduces incomplete gradients for those parameters, which can corrupt the final accumulated gradient once the ready shard is eventually reduced.

Useful? React with 👍 / 👎.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

delock · 2026-06-18T09:47:14Z

+                         "is planned as follow-up work.")
+
+    expert_width = spec.ep_size * spec.etp_size
+    if spec.tp_size > 1 and expert_width > spec.dp_size:


What will happen if expert_width % spec.dp_size != 0?

Good question! Your comment made me realize this was not only a remainder case, but also that the original version was still enforcing a stricter layout than Parallel Folding actually requires.

I generalized the layout after that: EP no longer has to be nested inside dense DP, so the previous checks around expert_width > dp_size and dp_size % expert_width != 0 are removed. In particular, expert_width does not need to divide dp_size; EP/EDP groups are derived over the per-stage rank set instead.

The remaining expert-side divisibility requirement is that the expert width tiles the stage:

stage_size % (ep * etp) == 0

If that does not hold, we still reject it fail-fast because edp cannot be derived cleanly. Otherwise dp and edp are derived under:

tp * dp == ep * etp * edp == stage_size

Examples that were previously rejected and now run include world=4 TP4/EP4 (dp=1, edp=1), world=4 TP2/EP4 (dp=2, edp=1), and world=8 TP4/EP4 (dp=2, edp=2).

delock · 2026-06-19T05:09:43Z

+            if not param.requires_grad or param.grad is None:
+                continue
+            if is_moe_param(param) or is_model_parallel_parameter(param):
+                continue


This filter cannot distingush router grads from laynorm grads. The router grads needs SUM because of dispatch of token to experts, but laynorm does not need SUM, so they need different reduce strategy.

Here is the test, I verified on a CPU system that this test case cannot pass.

# Copyright (c) DeepSpeed Team. # SPDX-License-Identifier: Apache-2.0 # DeepSpeed Team """Engine-path (zero_stage=0) parity: does folded layernorm/gate match the non-folded DP baseline in the FULL flow? (The author only tested ZeRO-2 parity; the engine path that runs at zero_stage=0/1 is untested.) CPU/Gloo, world=8. """ import deepspeed from unit.v1.moe.autoep_test_utils import make_autoep_config, run_cpu_gloo_test, seed_everything from unit.v1.moe.test_autoep_autotp_grad_parity import ( _router_grad_model, _run_router_grad_boundary, _full_grad_by_suffix, ) GATE_BASELINE = "model.layers.0.mlp.gate.weight" GATE_FOLDED = "model.layers.0.mlp.router.gate.weight" LN = "model.layers.0.input_layernorm.weight" def _baseline_cfg(): c = {k: v for k, v in make_autoep_config(zero_stage=0, ep_size=1, mixed_precision=False).items() if k != "expert_parallel"} c["gradient_accumulation_steps"] = 2 c["gradient_clipping"] = 0.0 c["communication_data_type"] = "fp32" c["optimizer"]["params"]["torch_adam"] = True return c def _folded_cfg(): c = make_autoep_config(zero_stage=0, ep_size=4, mixed_precision=False) c["gradient_accumulation_steps"] = 2 c["gradient_clipping"] = 0.0 c["communication_data_type"] = "fp32" c["optimizer"]["params"]["torch_adam"] = True c["expert_parallel"]["autoep_size"] = 4 c["tensor_parallel"] = {"autotp_size": 2, "partition_config": { "use_default_specs": False, "layer_specs": [{"patterns": [r".*\.weight$"], "partition_type": "skip"}]}} return c def _worker(rank, world_size, tmpdir): seed = 1234 tp_size = 2 logical_dp_world_size = world_size // tp_size logical_dp_rank = rank // tp_size seed_everything(seed) reference_state = _router_grad_model().state_dict() baseline_model = _router_grad_model() baseline_model.load_state_dict(reference_state) baseline_engine, *_ = deepspeed.initialize(model=baseline_model, config=_baseline_cfg()) _run_router_grad_boundary(baseline_engine, logical_dp_world_size=logical_dp_world_size, logical_dp_rank=logical_dp_rank, seed=seed) base_gate = _full_grad_by_suffix(baseline_engine, GATE_BASELINE) base_ln = _full_grad_by_suffix(baseline_engine, LN) folded_model = _router_grad_model() folded_model.load_state_dict(reference_state) folded_engine, *_ = deepspeed.initialize(model=folded_model, config=_folded_cfg()) _run_router_grad_boundary(folded_engine, logical_dp_world_size=logical_dp_world_size, logical_dp_rank=logical_dp_rank, seed=seed) folded_gate = _full_grad_by_suffix(folded_engine, GATE_FOLDED) folded_ln = _full_grad_by_suffix(folded_engine, LN) gate_ratio = (folded_gate.norm() / base_gate.norm()).item() ln_ratio = (folded_ln.norm() / base_ln.norm()).item() print(f"[rank {rank}] ENGINE(zero0) gate_ratio={gate_ratio:.4f} ln_ratio={ln_ratio:.4f}") if rank == 0: assert abs(gate_ratio - 1.0) <= 5e-3, f"gate parity: {gate_ratio}" assert abs(ln_ratio - 1.0) <= 5e-3, f"ln parity: {ln_ratio}" def test_b1_engine_path_parity(tmpdir): run_cpu_gloo_test(_worker, tmpdir, world_size=8)

Thanks for the precise diagnosis! You are right that router/gate and LayerNorm gradients cannot be safely distinguished by that old filter. I changed the reduction logic so it is keyed to each parameter family’s actual replication structure instead.

For the current folded AutoEP+AutoTP path, both router/gate and LayerNorm end up using AVERAGE over the TP token-replication group, but for different reasons. LayerNorm is simply TP-replicated. Router/gate is different: we drop the duplicated TP activations before MoE dispatch, then reconstruct the full TP-replicated token view after MoE with restore_combined; the backward of that reconstruction introduces a tp_size factor, and the TP AVERAGE reduction cancels it. The SUM path is kept for the token-partial / true-SP cases, but it is not the live router/gate path here.

While fixing this, I also found a separate scaling issue for routed-expert weights. Their gradients go through the same restore_combined backward factor, but they had been classified as SKIP, so the tp_size factor reached the optimizer. That made folded expert gradients tp_size too large. The fix now cancels that factor by dividing routed-expert grads by tp_size without adding a TP all-reduce; it composes with the existing EDP all-reduce.

Your zero_stage=0 reproducer now passes on the head of this PR branch. I also added parity coverage against the non-folded baseline for router/gate, LayerNorm, and routed-expert weights across TP2/EP4 and TP4/EP4 folded shapes, including ZeRO 0 and ZeRO 2.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Comment-only; no behavior change. Make the folded TP/SP gradient SUM-vs-AVERAGE policy and its reasoning explicit in the code: - mark_autoep_folding_* contracts: router_gate_replicated is the only marker applied on the live forward path (AVERAGE); routed-token-partial and SP-sharded-LayerNorm are future-only SUM contracts pinned by unit tests. - _AllGatherVariableRows / restore_combined: document the tp_size factor the restore all-gather backward injects, which makes the folded router/gate arrive replicated (AVERAGE) and why SUM reproduces the 2.0x parity regression the CPU/Gloo tests guard. - partition_assignments: the drop-before-all-to-all keeps EP dispatch at 1x volume, and the reconstruction is what makes the gradient replicated. - Strategy classifier: the router-vs-LayerNorm asymmetry (router can be a tp/sp partial because it rides the dispatch all-to-all; LayerNorm only under true SP) and the underlying gathered=AVERAGE / sharded=SUM rule. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Lift the temporary fail-fast that forced expert parallelism to be a subset of data parallelism (expert_width = ep*etp <= dp and dp % expert_width == 0) and support cross-lane EP, where an EP group may span TP lanes and dense-DP ranks (e.g. world=4 TP4/EP4/dp1, or world=4 TP2/DP2/EP4). The folded group tables already lay EP groups across the tp-lane-major rank ordering, so only the validation gate and the routed-expert gradient reduction needed to change. Fix routed-expert gradient over-scaling under folding. The folded forward all-gathers expert outputs into a replicated full view in restore_combined, whose backward injects a tp_size factor (the same factor the replicated router cancels via AVERAGE). Routed experts were classified SKIP, so that factor survived and the expert weight gradient reached the optimizer tp_size times too large. This was invisible to scale-invariant Adam but real for non-adaptive optimizers and for gradient clipping (it inflates the expert contribution to the global grad norm), and it grows with tp_size in cross-lane shapes. Routed experts now use a dedicated EXPERT_TP_CANCEL reduction that divides the gradient by tp_size with no TP all_reduce; the division is linear and composes with the existing expert-data-parallel all_reduce in any order. The per-family gradient convention is otherwise unchanged and is now justified for the general cross-lane layout: each family's reduction is keyed to its replication structure, not the EP layout. Router/gate and dense/LayerNorm AVERAGE over the TP (token-replication) group; routed experts cancel the restore tp_size factor and reduce data-parallel over the EDP group. Add CPU/Gloo parity coverage for cross-lane shapes (router/gate and LayerNorm vs a non-folded baseline at world=4 TP2/EP4 and TP4/EP4/dp1) and SGD-based expert-weight parity (post-step weight equality vs the non-folded baseline) for the MVP shape and cross-lane shapes including edp>1. Adam is scale-invariant, so the expert test uses SGD to actually exercise the gradient magnitude. Update the folding config/group-table tests: cross-lane shapes are now accepted, and the cross-lane EP/EDP rank tables are asserted. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

PKUWZP · 2026-06-24T08:16:13Z

@tohtana Thanks for the extensive amount of work, super appreciated. A couple of questions (I will also run some tests on my end):

Unbucketed per-param TP all_reduce. _reduce_autoep_folding_tp_replicated_gradients walks
module.named_parameters() and issues an individual all_reduce per replicated param at each grad-accum boundary(ZeRO-0/1 path). It might be fine for the current PR stage with small replicated set, but worth a note.
We might want to run tests on multi-node settings (e.g. 2 node = 16 H100/200) to benchmark the performance.

tohtana requested review from GuanhuaWang, hwchen2017, loadams and tjruwase as code owners June 13, 2026 17:43

chatgpt-codex-connector Bot reviewed Jun 13, 2026

View reviewed changes

tohtana added 2 commits June 13, 2026 11:44

Fix folded TP gradient reductions

0d44a1d

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Normalize folded TP ZeRO gradients

8b1c042

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

PKUWZP self-requested a review June 14, 2026 02:16

delock reviewed Jun 18, 2026

View reviewed changes

delock reviewed Jun 19, 2026

View reviewed changes

tohtana added 3 commits June 23, 2026 01:03

Fix AutoEP folded gradient strategy

7246b14

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AutoEP + AutoTP parallel folding#8064

Add AutoEP + AutoTP parallel folding#8064
tohtana wants to merge 6 commits into
deepspeedai:masterfrom
tohtana:tohtana/autoep-autotp-parallel-folding-design

tohtana commented Jun 13, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

delock Jun 18, 2026

Uh oh!

tohtana Jun 24, 2026

Uh oh!

delock Jun 19, 2026 •

edited

Loading

Uh oh!

delock Jun 19, 2026

Uh oh!

tohtana Jun 24, 2026

Uh oh!

PKUWZP commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		grad_reduc = self.get_gradient_for_reduction(param)
		self._maybe_reduce_autoep_folding_tp_gradient(param, grad_reduc)

Uh oh!

Conversation

tohtana commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Configuration

Cross-lane expert parallelism

What's included

Correctness & validation

Scope / follow-ups

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

delock Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

delock Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delock Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

PKUWZP commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tohtana commented Jun 13, 2026 •

edited

Loading

delock Jun 19, 2026 •

edited

Loading