Skip to content

Add AutoEP + AutoTP parallel folding#8064

Open
tohtana wants to merge 6 commits into
deepspeedai:masterfrom
tohtana:tohtana/autoep-autotp-parallel-folding-design
Open

Add AutoEP + AutoTP parallel folding#8064
tohtana wants to merge 6 commits into
deepspeedai:masterfrom
tohtana:tohtana/autoep-autotp-parallel-folding-design

Conversation

@tohtana

@tohtana tohtana commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

This PR adds parallel folding for AutoEP: tensor parallelism (AutoTP) for the dense/attention path can now coexist with expert parallelism (AutoEP) for the routed-expert path on the same set of ranks, without forcing EP to be a subset of DP — an EP group may span TP lanes and dense-DP ranks (cross-lane EP).
(This PR should be adjusted for ZeRO3 support after #8060 is merged)

Design

Attention/dense and MoE are treated as two independent partitionings of the same rank set, parameterized per parameter family:

  • Dense / attention / shared-expert params: stage_size = tp * dp
  • Routed-expert params: stage_size = ep * etp * edp

dp and edp are always derived, never user-configured, so the invariant tp * dp == ep * etp * edp == stage_size cannot be broken from config. The only structural requirement is that the expert width tiles the stage (stage_size % (ep * etp) == 0); EP groups are then laid across a TP-lane-major rank ordering, so they may span TP lanes and dense-DP ranks.

Configuration

No new config section. Folding is expressed by the coexistence of the existing tensor_parallel and expert_parallel sections:

{
  "tensor_parallel":  { "autotp_size": 4 },
  "expert_parallel":  { "enabled": true, "autoep_size": 4,
                        "expert_tensor_parallel_size": 1 }
}

expert_tensor_parallel_size is carried as a config field but currently must be 1 (expert-internal TP is reserved as follow-up and rejected fail-fast). Validation enforces stage divisibility, TP/sequence-parallel exclusivity, and preset_model consistency between the two sections.

Cross-lane expert parallelism

Expert parallelism no longer has to be a subset of data parallelism. Shapes where the expert width exceeds (or does not divide) the dense data-parallel size are supported, for example:

  • world=4, TP=4, EP=4 (dp=1): the EP group is the whole TP group — one expert per rank.
  • world=4, TP=2, EP=4 (dp=2): the EP group spans both TP lanes and both DP ranks.
  • world=8, TP=4, EP=4 (dp=2, edp=2): EP groups span TP lanes with expert replication.

The per-family gradient convention is keyed to each parameter's replication structure, not to the EP layout, so it holds across the whole tp*dp pool:

  • Router/gate and dense/LayerNorm are AVERAGE over the TP (token-replication) group. The folded router runs redundantly on every TP peer; its partitioned work is reconstructed into a replicated full view by restore_combined, whose all-gather backward injects a tp_size factor that AVERAGE divides out.
  • Routed experts cancel that same restore_combined tp_size factor (divide by tp_size, no TP all-reduce) and reduce data-parallel over the expert-data-parallel (EDP) group. Without the cancellation, folded expert gradients are over-scaled by tp_size — invisible to scale-invariant Adam, but real for non-adaptive optimizers and for gradient clipping (it inflates the expert contribution to the global grad norm). This is now fixed for all folded shapes (the MVP TP2×EP4 shape included).

What's included

  • Folded process-group derivation using the generalized expert/data-parallel group creation (mp_mode TP-strided vs SP-consecutive ordering), including cross-lane EP group tables.
  • Route-full / partition-dispatch path for folded MoE (deepspeed/moe/ep_tp_dispatch.py), with AutoTP skipping AutoEP subtrees.
  • Per-family folded gradient reduction: AVERAGE for replicated router/gate and dense/LayerNorm; a dedicated tp_size cancellation for routed experts; SKIP for genuinely TP-sharded params; SUM contracts reserved for a future true sequence-parallel path.
  • Per-parameter-family ZeRO checkpoint metadata (routed-expert vs dense/router/shared placement) and folded ZeRO-1/2 optimizer-state handling.

Correctness & validation

  • Router/gate and LayerNorm gradient parity against a non-folded ZeRO baseline (atol=1e-1, rtol=5e-3, fp32), on TP2×EP4 (8-rank) and the cross-lane shapes TP2×EP4 (4-rank, ep>dp) and TP4×EP4 (4-rank, dp=1); scale 1.0.
  • Routed-expert weight parity against a non-folded baseline, verified with SGD (Adam is scale-invariant and would mask a uniform gradient-scale error), for the MVP TP2×EP4 shape and cross-lane TP4×EP4 with edp=1 and edp=2.
  • New folding unit tests for config, cross-lane group layout, dispatch, runtime, gradient parity, and checkpoint save/load (multi-rank GPU cases gated for GPU runners; CPU/Gloo parity runs on CI).
  • Real-H100 confirmation (8×H100): router/gate, LayerNorm, and routed-expert gradient parity to the non-folded baseline hold (scale 1.0) for MVP TP2×EP4 and cross-lane TP4×EP4 (edp=1 and edp=2); cross-lane folded training runs with finite loss and finite, non-zero expert/router gradients.
  • Passes the full unit test suite (aws-torch-latest-full) on H100 GPUs.

Scope / follow-ups

  • This PR covers AutoEP + AutoTP folding, including cross-lane EP with etp=1. The replicated-grad reduction is mode-aware so the sequence-parallel (Ulysses) folding case fits the same contract; AutoTP + AutoEP is the validated path here.
  • Expert-internal tensor parallelism (expert_tensor_parallel_size > 1) is reserved for a follow-up.
  • ZeRO-3 composition with folding is planned as separate follow-up work (after Support AutoEP with ZeRO-3 zero.Init source modules #8060 is merged).

Allow tensor parallelism (AutoTP) for the dense/attention path to coexist
with expert parallelism (AutoEP) for routed experts on the same rank set,
without requiring EP to be a subset of DP.

- Treat dense and MoE as independent partitionings: dense view tp*dp,
  expert view ep*etp*edp, with dp/edp derived so tp*dp == ep*etp*edp ==
  stage_size. expert_tensor_parallel_size is reserved (must currently be 1).
- Express folding via the existing tensor_parallel/expert_parallel config
  sections, with divisibility, TP/sequence-parallel exclusivity, and
  preset_model consistency validation.
- Add the route-full / partition-dispatch MoE path and AutoTP skipping of
  AutoEP subtrees; derive folded process groups via the generalized
  expert/data-parallel group creation.
- Reduce TP-replicated router/gate gradients mode-aware (sum when tokens are
  partitioned, average when replicated); record per-parameter-family ZeRO
  checkpoint metadata and handle folded ZeRO-1/2 optimizer state.
- Add folding unit tests (config, groups, dispatch, runtime, gradient parity,
  checkpoint), including multi-rank GPU-gated cases.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 278c919489

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread deepspeed/moe/ep_tp_dispatch.py Outdated
Comment on lines +216 to +220
chunks = torch.split(grad_output, ctx.counts, dim=0)
grad_padded = grad_output.new_zeros((ctx.max_rows, *grad_output.shape[1:]))
if local_count:
grad_padded[:local_count].copy_(chunks[ctx.group_rank])
return grad_padded[:local_count].contiguous(), None, None, None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Sum gathered-row gradients across TP lanes

When folded MoE output is consumed differently on each TP lane (for example by a row-parallel/lm-head layer that slices the hidden dimension), every gathered row participates in the loss on every lane. This backward path only returns chunks[ctx.group_rank] from the local rank's grad_output, so contributions from peer lanes to this rank's local expert outputs and routing weights are dropped; the padded local gradient needs to be accumulated across ctx.group before returning.

Useful? React with 👍 / 👎.

Comment thread deepspeed/runtime/zero/stage_1_and_2.py Outdated
Comment on lines +1120 to +1121
grad_reduc = self.get_gradient_for_reduction(param)
self._maybe_reduce_autoep_folding_tp_gradient(param, grad_reduc)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor ds_grad_is_ready before TP reduction

In ZeRO-2 folded runs, parameters with ds_grad_is_ready=False are intentionally skipped until their transient/tiled gradient is complete, as the guard immediately below documents. Calling the new TP reduction before that guard mutates and all-reduces incomplete gradients for those parameters, which can corrupt the final accumulated gradient once the ready shard is eventually reduced.

Useful? React with 👍 / 👎.

tohtana added 2 commits June 13, 2026 11:44
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@PKUWZP PKUWZP self-requested a review June 14, 2026 02:16
"is planned as follow-up work.")

expert_width = spec.ep_size * spec.etp_size
if spec.tp_size > 1 and expert_width > spec.dp_size:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if expert_width % spec.dp_size != 0?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Your comment made me realize this was not only a remainder case, but also that the original version was still enforcing a stricter layout than Parallel Folding actually requires.

I generalized the layout after that: EP no longer has to be nested inside dense DP, so the previous checks around expert_width > dp_size and dp_size % expert_width != 0 are removed. In particular, expert_width does not need to divide dp_size; EP/EDP groups are derived over the per-stage rank set instead.

The remaining expert-side divisibility requirement is that the expert width tiles the stage:

stage_size % (ep * etp) == 0

If that does not hold, we still reject it fail-fast because edp cannot be derived cleanly. Otherwise dp and edp are derived under:

tp * dp == ep * etp * edp == stage_size

Examples that were previously rejected and now run include world=4 TP4/EP4 (dp=1, edp=1), world=4 TP2/EP4 (dp=2, edp=1), and world=8 TP4/EP4 (dp=2, edp=2).

Comment thread deepspeed/runtime/engine.py Outdated
if not param.requires_grad or param.grad is None:
continue
if is_moe_param(param) or is_model_parallel_parameter(param):
continue

@delock delock Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter cannot distingush router grads from laynorm grads. The router grads needs SUM because of dispatch of token to experts, but laynorm does not need SUM, so they need different reduce strategy.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the test, I verified on a CPU system that this test case cannot pass.

# Copyright (c) DeepSpeed Team.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
"""Engine-path (zero_stage=0) parity: does folded layernorm/gate match the non-folded
DP baseline in the FULL flow? (The author only tested ZeRO-2 parity; the engine path
that runs at zero_stage=0/1 is untested.) CPU/Gloo, world=8.
"""
import deepspeed
from unit.v1.moe.autoep_test_utils import make_autoep_config, run_cpu_gloo_test, seed_everything
from unit.v1.moe.test_autoep_autotp_grad_parity import (
    _router_grad_model,
    _run_router_grad_boundary,
    _full_grad_by_suffix,
)

GATE_BASELINE = "model.layers.0.mlp.gate.weight"
GATE_FOLDED = "model.layers.0.mlp.router.gate.weight"
LN = "model.layers.0.input_layernorm.weight"


def _baseline_cfg():
    c = {k: v for k, v in make_autoep_config(zero_stage=0, ep_size=1, mixed_precision=False).items()
         if k != "expert_parallel"}
    c["gradient_accumulation_steps"] = 2
    c["gradient_clipping"] = 0.0
    c["communication_data_type"] = "fp32"
    c["optimizer"]["params"]["torch_adam"] = True
    return c


def _folded_cfg():
    c = make_autoep_config(zero_stage=0, ep_size=4, mixed_precision=False)
    c["gradient_accumulation_steps"] = 2
    c["gradient_clipping"] = 0.0
    c["communication_data_type"] = "fp32"
    c["optimizer"]["params"]["torch_adam"] = True
    c["expert_parallel"]["autoep_size"] = 4
    c["tensor_parallel"] = {"autotp_size": 2, "partition_config": {
        "use_default_specs": False, "layer_specs": [{"patterns": [r".*\.weight$"], "partition_type": "skip"}]}}
    return c


def _worker(rank, world_size, tmpdir):
    seed = 1234
    tp_size = 2
    logical_dp_world_size = world_size // tp_size
    logical_dp_rank = rank // tp_size

    seed_everything(seed)
    reference_state = _router_grad_model().state_dict()

    baseline_model = _router_grad_model()
    baseline_model.load_state_dict(reference_state)
    baseline_engine, *_ = deepspeed.initialize(model=baseline_model, config=_baseline_cfg())
    _run_router_grad_boundary(baseline_engine,
                              logical_dp_world_size=logical_dp_world_size,
                              logical_dp_rank=logical_dp_rank,
                              seed=seed)
    base_gate = _full_grad_by_suffix(baseline_engine, GATE_BASELINE)
    base_ln = _full_grad_by_suffix(baseline_engine, LN)

    folded_model = _router_grad_model()
    folded_model.load_state_dict(reference_state)
    folded_engine, *_ = deepspeed.initialize(model=folded_model, config=_folded_cfg())
    _run_router_grad_boundary(folded_engine,
                              logical_dp_world_size=logical_dp_world_size,
                              logical_dp_rank=logical_dp_rank,
                              seed=seed)
    folded_gate = _full_grad_by_suffix(folded_engine, GATE_FOLDED)
    folded_ln = _full_grad_by_suffix(folded_engine, LN)

    gate_ratio = (folded_gate.norm() / base_gate.norm()).item()
    ln_ratio = (folded_ln.norm() / base_ln.norm()).item()
    print(f"[rank {rank}] ENGINE(zero0) gate_ratio={gate_ratio:.4f} ln_ratio={ln_ratio:.4f}")

    if rank == 0:
        assert abs(gate_ratio - 1.0) <= 5e-3, f"gate parity: {gate_ratio}"
        assert abs(ln_ratio - 1.0) <= 5e-3, f"ln parity: {ln_ratio}"


def test_b1_engine_path_parity(tmpdir):
    run_cpu_gloo_test(_worker, tmpdir, world_size=8)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the precise diagnosis! You are right that router/gate and LayerNorm gradients cannot be safely distinguished by that old filter. I changed the reduction logic so it is keyed to each parameter family’s actual replication structure instead.

For the current folded AutoEP+AutoTP path, both router/gate and LayerNorm end up using AVERAGE over the TP token-replication group, but for different reasons. LayerNorm is simply TP-replicated. Router/gate is different: we drop the duplicated TP activations before MoE dispatch, then reconstruct the full TP-replicated token view after MoE with restore_combined; the backward of that reconstruction introduces a tp_size factor, and the TP AVERAGE reduction cancels it. The SUM path is kept for the token-partial / true-SP cases, but it is not the live router/gate path here.

While fixing this, I also found a separate scaling issue for routed-expert weights. Their gradients go through the same restore_combined backward factor, but they had been classified as SKIP, so the tp_size factor reached the optimizer. That made folded expert gradients tp_size too large. The fix now cancels that factor by dividing routed-expert grads by tp_size without adding a TP all-reduce; it composes with the existing EDP all-reduce.

Your zero_stage=0 reproducer now passes on the head of this PR branch. I also added parity coverage against the non-folded baseline for router/gate, LayerNorm, and routed-expert weights across TP2/EP4 and TP4/EP4 folded shapes, including ZeRO 0 and ZeRO 2.

tohtana added 3 commits June 23, 2026 01:03
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Comment-only; no behavior change. Make the folded TP/SP gradient
SUM-vs-AVERAGE policy and its reasoning explicit in the code:

- mark_autoep_folding_* contracts: router_gate_replicated is the only marker
  applied on the live forward path (AVERAGE); routed-token-partial and
  SP-sharded-LayerNorm are future-only SUM contracts pinned by unit tests.
- _AllGatherVariableRows / restore_combined: document the tp_size factor the
  restore all-gather backward injects, which makes the folded router/gate
  arrive replicated (AVERAGE) and why SUM reproduces the 2.0x parity
  regression the CPU/Gloo tests guard.
- partition_assignments: the drop-before-all-to-all keeps EP dispatch at 1x
  volume, and the reconstruction is what makes the gradient replicated.
- Strategy classifier: the router-vs-LayerNorm asymmetry (router can be a
  tp/sp partial because it rides the dispatch all-to-all; LayerNorm only
  under true SP) and the underlying gathered=AVERAGE / sharded=SUM rule.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Lift the temporary fail-fast that forced expert parallelism to be a subset of
data parallelism (expert_width = ep*etp <= dp and dp % expert_width == 0) and
support cross-lane EP, where an EP group may span TP lanes and dense-DP ranks
(e.g. world=4 TP4/EP4/dp1, or world=4 TP2/DP2/EP4). The folded group tables
already lay EP groups across the tp-lane-major rank ordering, so only the
validation gate and the routed-expert gradient reduction needed to change.

Fix routed-expert gradient over-scaling under folding. The folded forward
all-gathers expert outputs into a replicated full view in restore_combined,
whose backward injects a tp_size factor (the same factor the replicated router
cancels via AVERAGE). Routed experts were classified SKIP, so that factor
survived and the expert weight gradient reached the optimizer tp_size times too
large. This was invisible to scale-invariant Adam but real for non-adaptive
optimizers and for gradient clipping (it inflates the expert contribution to the
global grad norm), and it grows with tp_size in cross-lane shapes. Routed
experts now use a dedicated EXPERT_TP_CANCEL reduction that divides the gradient
by tp_size with no TP all_reduce; the division is linear and composes with the
existing expert-data-parallel all_reduce in any order.

The per-family gradient convention is otherwise unchanged and is now justified
for the general cross-lane layout: each family's reduction is keyed to its
replication structure, not the EP layout. Router/gate and dense/LayerNorm
AVERAGE over the TP (token-replication) group; routed experts cancel the restore
tp_size factor and reduce data-parallel over the EDP group.

Add CPU/Gloo parity coverage for cross-lane shapes (router/gate and LayerNorm vs
a non-folded baseline at world=4 TP2/EP4 and TP4/EP4/dp1) and SGD-based
expert-weight parity (post-step weight equality vs the non-folded baseline) for
the MVP shape and cross-lane shapes including edp>1. Adam is scale-invariant, so
the expert test uses SGD to actually exercise the gradient magnitude. Update the
folding config/group-table tests: cross-lane shapes are now accepted, and the
cross-lane EP/EDP rank tables are asserted.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@PKUWZP

PKUWZP commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

@tohtana Thanks for the extensive amount of work, super appreciated. A couple of questions (I will also run some tests on my end):

  • Unbucketed per-param TP all_reduce. _reduce_autoep_folding_tp_replicated_gradients walks
    module.named_parameters() and issues an individual all_reduce per replicated param at each grad-accum boundary(ZeRO-0/1 path). It might be fine for the current PR stage with small replicated set, but worth a note.
  • We might want to run tests on multi-node settings (e.g. 2 node = 16 H100/200) to benchmark the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants