Very long ns-3 simulation runtime for 64-GPU Megatron/MoE workload, looking for recommended acceleration settings

Hi SimAI team,

Thank you for releasing SimAI. I am currently using SimAI to simulate communication time for large-model training workloads with custom topology and placement, and I have encountered a very long runtime when using the ns-3 backend. I would like to ask whether there are recommended settings, build options, or simulation modes that I may have missed.

My workload is generated by AICB / SimAI workload generator with the following configuration:
```shell
python3 -m workload_generator.SimAI_training_workload_generator \
  --frame Megatron \
  --model_name Mixtral_8x7B_tp8_dp2_cp4_ep4_sp_hier \
  --world_size 64 \
  --tensor_model_parallel_size 8 \
  --context_parallel_size 4 \
  --pipeline_model_parallel 1 \
  --expert_model_parallel_size 4 \
  --global_batch 2 \
  --micro_batch 1 \
  --epoch_num 1 \
  --num_layers 32 \
  --seq_length 4096 \
  --hidden_size 4096 \
  --num_attention_heads 32 \
  --ffn_hidden_size 14336 \
  --vocab_size 32000 \
  --max_position_embeddings 4096 \
  --moe_enable \
  --moe_router_topk 2 \
  --num_experts 8 \
  --moe_grouped_gemm \
  --enable_sequence_parallel \
  --swiglu \
  --use-distributed-optimizer
``` 
The generated workload has 64 GPUs, TP=8, DP=2, CP=4, EP=4, PP=1, 32 layers, and includes TP/CP collectives, EP all-to-all, and large DP/DP-MoE optimizer or gradient collectives. I found that although the current version reserves command-line arguments for CP, no further functionality has actually been implemented. Therefore, I additionally implemented the CP-related workload, collective groups, and communication components.
I run the ns-3 simulator with:
```shell
./bin/SimAI_simulator \
  -t 48 \
  -w <generated_workload.txt> \
  -n <64gpu_topology.txt> \
  -c <network.conf>
``` 
The machine has 96 logical CPUs:
```shell
2 sockets, 24 cores/socket, 2 threads/core
Intel Xeon Gold 5318Y
``` 
The current build appears to enable `NS3_MTP` but not `PHY_MTP`. I confirmed that the compile commands contain `-DNS3_MTP`, and I pass `-t 48` to the simulator.

The simulation progresses correctly, but it is extremely slow. In one run, the forward phase and most backward input-gradient collectives completed, but the final large weight-gradient collectives were still very time-consuming. For example, the tail of the workload includes collectives such as:
```shell
moe_grad_norm2: REDUCESCATTER, size ~= 4.30 GB
moe_grad_norm1: ALLGATHER_DP_EP, size ~= 2.15 GB
grad_param_comm: REDUCESCATTER, size ~= 1.08 GB
grad_gather: ALLGATHER, size ~= 0.54 GB
``` 
These large collectives seem to dominate the runtime. I also noticed that CSV result files are only written after the full workload finishes, so it is difficult to obtain partial results if the final collectives take too long.

My questions are:

For this type of 64-GPU training workload, is ns-3 packet-level simulation expected to take this long?
Are there recommended build options to improve runtime, for example release mode instead of debug mode?
Should I enable `PHY_MTP`, `NS3_MTP`, or another backend/mode for large training workloads?
Is there a recommended way to use SimCCL-style acceleration or analytical collective timing while still preserving topology-aware communication behavior?
For paper-scale experiments such as 512/1024 GPUs, which backend and settings were used in your evaluation to keep simulation runtime manageable?

Any guidance on the recommended workflow for large-scale Megatron/MoE communication simulation would be greatly appreciated.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Very long ns-3 simulation runtime for 64-GPU Megatron/MoE workload, looking for recommended acceleration settings #293

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Very long ns-3 simulation runtime for 64-GPU Megatron/MoE workload, looking for recommended acceleration settings #293

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions