LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe expert load imbalance caused by dynamic routing in expert parallel training, where a few overloaded experts become performance bottlenecks. To mitigate this issue, the authors propose LAER-MoE, a novel framework that introduces the Fully Sharded Expert Parallel (FSEP) paradigm, enabling on-demand dynamic rearrangement of expert parameter layouts during training. LAER-MoE further incorporates a load-balancing planner and a fine-grained communication scheduling mechanism to jointly optimize expert and token assignment strategies. Experiments on an A100 GPU cluster demonstrate that LAER-MoE achieves up to 1.69× speedup over state-of-the-art systems, significantly alleviating load imbalance and improving the training efficiency of Mixture-of-Experts (MoE) models.

Technology Category

Application Category

📝 Abstract
Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
expert parallelism
load imbalance
dynamic routing
training bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Expert Parallelism
Load Balancing
Fully Sharded Expert Parallel
All-to-All Communication
🔎 Similar Papers
No similar papers found.