🤖 AI Summary
To address the drastic surge in communication overhead and sharp decline in training efficiency for large-scale Mixture-of-Experts (MoE) models on heterogeneous hardware, this paper proposes the first hierarchical communication optimization system tailored for production-grade MoE training. Our method introduces three key innovations: (1) operator-specific parallelism strategies—distinct for attention and feed-forward network (FFN) modules—to enable both inter- and intra-operator communication-computation overlap; (2) a dynamic communication compression mechanism adaptive to low-precision gradients; and (3) a hybrid parallelism scheme integrating data, expert, and tensor sharding, coupled with adaptive communication pattern scheduling. Evaluated on 1,440 Hopper GPUs, our system trains a 352B MoE model at 1.41M tokens/sec—1.88× faster than Megatron-LM—demonstrating substantial improvements in scalability and practicality for large-scale MoE training.
📝 Abstract
We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$ imes$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.