MicroMoE: Fine-Grained Load Balancing for Mixture-of-Experts with Token Scheduling

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

Existing approaches to address expert load imbalance in Mixture-of-Experts (MoE) models—caused by dynamic token routing—struggle to simultaneously achieve high accuracy, computational efficiency, and low system overhead. Method: We propose MicroEP, a fine-grained load balancing framework that enables optimal token scheduling at the micro-batch level for the first time. MicroEP employs a cross-GPU distributed token redistribution mechanism, coupled with customized Expert Parallelism (EP) communication optimizations and a lightweight runtime scheduler—introducing no additional computation or memory overhead. Contribution/Results: MicroEP achieves near-theoretical-optimal load balancing across experts during multi-GPU training without sacrificing model accuracy. Experiments demonstrate up to 47.6% improvement in end-to-end training throughput, establishing a scalable, system-level solution for efficient large-scale MoE training.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) has emerged as a promising approach to scale up deep learning models due to its significant reduction in computational resources. However, the dynamic nature of MoE leads to load imbalance among experts, severely impacting training efficiency. While previous research has attempted to address the load balancing challenge, existing solutions either compromise model accuracy or introduce additional system overhead. As a result, they fail to achieve fine-grained load balancing, which is crucial to optimizing training efficiency. We propose MicroEP, a novel parallelization strategy to achieve fine-grained load balancing in MoE systems. MicroEP is capable of achieving optimal load balancing in every micro-batch through efficient token scheduling across GPUs. Furthermore, we propose MicroMoE, an efficient distributed MoE training system with MicroEP's load balancing capabilities. Our experimental results demonstrate that MicroMoE improves the end-to-end training throughput by up to 47.6% compared with the state-of-the-art system, and almost consistently achieves optimal load balance among GPUs.

Problem

Research questions and friction points this paper is trying to address.

Addressing load imbalance in Mixture-of-Experts training systems

Achieving fine-grained load balancing without compromising model accuracy

Optimizing token scheduling across GPUs to enhance training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained load balancing via token scheduling

MicroEP parallelization strategy for MoE systems

Distributed training system achieving optimal GPU balance

🔎 Similar Papers

No similar papers found.