X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address scalability challenges in training Mixture-of-Experts (MoE) models—such as DeepSeek-MoE—on non-NVIDIA platforms, including high activation memory overhead, all-to-all communication bottlenecks, and limited cross-platform support, this paper proposes an efficient, cross-platform MoE training framework. Our approach introduces: (1) a token-padding-free sparse scheduling kernel; (2) a redundant bypass dispatch mechanism to eliminate redundant data distribution; and (3) an MoE block design integrating sequence sharding with hybrid parallelism. To our knowledge, this is the first work to achieve scalable, thousand-GPU MoE training on the Frontier supercomputer (AMD GPUs), supporting a 545-billion-parameter model—ten times larger than prior methods—with significantly improved throughput. All code is publicly released.

Technology Category

Application Category

📝 Abstract

Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at https://github.com/Supercomputing-System-AI-Lab/X-MoE.

Problem

Research questions and friction points this paper is trying to address.

Scalable training for MoE architectures on HPC platforms

Overcoming activation memory overhead and communication costs

Optimizing MoE training performance on non-NVIDIA hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-platform kernels for efficient MoE training

Redundancy-bypassing dispatch mechanism

Hybrid parallelism with sequence-sharded MoE blocks

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions