HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address substantial cross-GPU communication overhead and severe load imbalance in Mixture-of-Experts (MoE) training—caused by skewed token-to-expert assignment—the paper proposes a hierarchical optimization framework. This framework integrates hierarchical token deduplication, dynamic expert swapping, and topology-aware distributed modeling, implemented as an efficient prototype atop Megatron-LM. Its key innovation lies in jointly modeling communication optimization and computational load balancing while explicitly incorporating real hardware interconnect topology. Experiments on a 32-GPU cluster demonstrate that the approach achieves 1.55–3.32× higher communication throughput and 1.18–1.27× end-to-end training speedup over state-of-the-art baselines Tutel-2DH and SmartMoE, respectively. These improvements significantly enhance the scalability of large-scale MoE models in distributed training settings.

Technology Category

Application Category

📝 Abstract
The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build theoretical models aimed at achieving the best token duplication and expert swap strategy under different model configurations and hardware environments. We implement our prototype HierMoE system atop Megatron-LM and conduct experiments on a 32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results show that our HierMoE achieves $1.55 imes$ to $3.32 imes$ faster communication and delivers $1.18 imes$ to $1.27 imes$ faster end-to-end training compared to state-of-the-art MoE training systems, Tutel-2DH, SmartMoE, and Megatron-LM.
Problem

Research questions and friction points this paper is trying to address.

Reduces communication traffic in MoE training
Balances GPU workloads in distributed systems
Improves scalability of MoE model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical token deduplication reduces communication traffic
Expert swap balances GPU workloads efficiently
Theoretical models optimize token duplication and swap strategies
🔎 Similar Papers
No similar papers found.
W
Wenxiang Lin
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
Xinglin Pan
Xinglin Pan
Hong Kong University of Science and Technology (Guangzhou)
Parallel and Distributed ComputingDeep Learning
L
Lin Zhang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
Shaohuai Shi
Shaohuai Shi
Professor, Harbin Institute of Technology, Shenzhen
Machine Learning SystemsParallel and Distributed ComputingGPU ComputingDeep Learning
X
Xuan Wang
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
Xiaowen Chu
Xiaowen Chu
IEEE Fellow, Professor, Data Science and Analytics, HKUST(GZ)
GPU ComputingMachine Learning SystemsParallel and Distributed ComputingWireless Networks