MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address parallelization bottlenecks hindering training efficiency of large-scale Mixture-of-Experts (MoE) models across thousands of GPUs, this work proposes a five-dimensional heterogeneous hybrid parallelism framework. It introduces MoE Parallel Folding—a novel mechanism that decouples parallelization strategies for attention and MoE layers, enabling their independent, optimal configuration. A dynamic token-level dispatcher is designed to support both token-dropping and dropless routing. The framework unifies tensor, expert, sequence-context, data, and pipeline parallelism, and—built upon Megatron-Core—supports dynamic tensor shapes and cross-dimensional coordinated scheduling. Evaluated on an H100 cluster, the framework achieves 49.3% MFU for Mixtral 8×22B and 39.0% MFU for Qwen2-57B-A14B, demonstrating strong scalability up to 1,000 GPUs and native support for sequences up to 128K tokens.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.
Problem

Research questions and friction points this paper is trying to address.

Efficient large-scale MoE model training across thousands of GPUs
Optimal parallel configurations for attention and MoE layers
Dynamic token-level dispatching for hybrid parallelism strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Five-dimensional hybrid parallelism for MoE training
MoE Parallel Folding decouples attention and MoE layers
Flexible token-level dispatcher supports dynamic tensor shapes
D
Dennis Liu
NVIDIA
Zijie Yan
Zijie Yan
Nvidia
Large Scale Deep Learning
X
Xin Yao
NVIDIA
T
Tong Liu
NVIDIA
V
Vijay Korthikanti
NVIDIA
E
Evan Wu
NVIDIA
Shiqing Fan
Shiqing Fan
NVIDIA
Deep LearningDistributed SystemsE2E Optimization
G
Gao Deng
NVIDIA
H
Hongxiao Bai
NVIDIA
A
Ashwath Aithal
NVIDIA
M
Michael Andersch
NVIDIA
Mohammad Shoeybi
Mohammad Shoeybi
Senior Director of Applied Research at NVIDIA
Large Language ModelsNLPMulti-Modal ModelsGenerative AI
J
Jiajie Yao
NVIDIA
C
Chandler Zhou
NVIDIA
D
David Wu
NVIDIA
X
Xipeng Li
NVIDIA
J
June Yang
NVIDIA