🤖 AI Summary
To address scalability challenges in training Mixture-of-Experts (MoE) models—such as DeepSeek-MoE—on non-NVIDIA platforms, including high activation memory overhead, all-to-all communication bottlenecks, and limited cross-platform support, this paper proposes an efficient, cross-platform MoE training framework. Our approach introduces: (1) a token-padding-free sparse scheduling kernel; (2) a redundant bypass dispatch mechanism to eliminate redundant data distribution; and (3) an MoE block design integrating sequence sharding with hybrid parallelism. To our knowledge, this is the first work to achieve scalable, thousand-GPU MoE training on the Frontier supercomputer (AMD GPUs), supporting a 545-billion-parameter model—ten times larger than prior methods—with significantly improved throughput. All code is publicly released.
📝 Abstract
Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at https://github.com/Supercomputing-System-AI-Lab/X-MoE.