Skip to content

Very long ns-3 simulation runtime for 64-GPU Megatron/MoE workload, looking for recommended acceleration settings #293

Description

@Zyangzzz

Hi SimAI team,

Thank you for releasing SimAI. I am currently using SimAI to simulate communication time for large-model training workloads with custom topology and placement, and I have encountered a very long runtime when using the ns-3 backend. I would like to ask whether there are recommended settings, build options, or simulation modes that I may have missed.

My workload is generated by AICB / SimAI workload generator with the following configuration:

python3 -m workload_generator.SimAI_training_workload_generator \
  --frame Megatron \
  --model_name Mixtral_8x7B_tp8_dp2_cp4_ep4_sp_hier \
  --world_size 64 \
  --tensor_model_parallel_size 8 \
  --context_parallel_size 4 \
  --pipeline_model_parallel 1 \
  --expert_model_parallel_size 4 \
  --global_batch 2 \
  --micro_batch 1 \
  --epoch_num 1 \
  --num_layers 32 \
  --seq_length 4096 \
  --hidden_size 4096 \
  --num_attention_heads 32 \
  --ffn_hidden_size 14336 \
  --vocab_size 32000 \
  --max_position_embeddings 4096 \
  --moe_enable \
  --moe_router_topk 2 \
  --num_experts 8 \
  --moe_grouped_gemm \
  --enable_sequence_parallel \
  --swiglu \
  --use-distributed-optimizer

The generated workload has 64 GPUs, TP=8, DP=2, CP=4, EP=4, PP=1, 32 layers, and includes TP/CP collectives, EP all-to-all, and large DP/DP-MoE optimizer or gradient collectives. I found that although the current version reserves command-line arguments for CP, no further functionality has actually been implemented. Therefore, I additionally implemented the CP-related workload, collective groups, and communication components.
I run the ns-3 simulator with:

./bin/SimAI_simulator \
  -t 48 \
  -w <generated_workload.txt> \
  -n <64gpu_topology.txt> \
  -c <network.conf>

The machine has 96 logical CPUs:

2 sockets, 24 cores/socket, 2 threads/core
Intel Xeon Gold 5318Y

The current build appears to enable NS3_MTP but not PHY_MTP. I confirmed that the compile commands contain -DNS3_MTP, and I pass -t 48 to the simulator.

The simulation progresses correctly, but it is extremely slow. In one run, the forward phase and most backward input-gradient collectives completed, but the final large weight-gradient collectives were still very time-consuming. For example, the tail of the workload includes collectives such as:

moe_grad_norm2: REDUCESCATTER, size ~= 4.30 GB
moe_grad_norm1: ALLGATHER_DP_EP, size ~= 2.15 GB
grad_param_comm: REDUCESCATTER, size ~= 1.08 GB
grad_gather: ALLGATHER, size ~= 0.54 GB

These large collectives seem to dominate the runtime. I also noticed that CSV result files are only written after the full workload finishes, so it is difficult to obtain partial results if the final collectives take too long.

My questions are:

For this type of 64-GPU training workload, is ns-3 packet-level simulation expected to take this long?
Are there recommended build options to improve runtime, for example release mode instead of debug mode?
Should I enable PHY_MTP, NS3_MTP, or another backend/mode for large training workloads?
Is there a recommended way to use SimCCL-style acceleration or analytical collective timing while still preserving topology-aware communication behavior?
For paper-scale experiments such as 512/1024 GPUs, which backend and settings were used in your evaluation to keep simulation runtime manageable?

Any guidance on the recommended workflow for large-scale Megatron/MoE communication simulation would be greatly appreciated.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions