MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

262K/year

🤖 AI Summary

To address the drastic surge in communication overhead and sharp decline in training efficiency for large-scale Mixture-of-Experts (MoE) models on heterogeneous hardware, this paper proposes the first hierarchical communication optimization system tailored for production-grade MoE training. Our method introduces three key innovations: (1) operator-specific parallelism strategies—distinct for attention and feed-forward network (FFN) modules—to enable both inter- and intra-operator communication-computation overlap; (2) a dynamic communication compression mechanism adaptive to low-precision gradients; and (3) a hybrid parallelism scheme integrating data, expert, and tensor sharding, coupled with adaptive communication pattern scheduling. Evaluated on 1,440 Hopper GPUs, our system trains a 352B MoE model at 1.41M tokens/sec—1.88× faster than Megatron-LM—demonstrating substantial improvements in scalability and practicality for large-scale MoE training.

Technology Category

Application Category

📝 Abstract

We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$ imes$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

Problem

Research questions and friction points this paper is trying to address.

Efficient training of large-scale MoE models

Optimizing communication in MoE model parallelism

Improving training throughput for massive MoE architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized communication-efficient parallelism strategies

Overlap communication with computation holistically

Apply communication compression with adjusted patterns

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions