Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
MoE models suffer from two key bottlenecks: imbalanced expert activation—causing parallel idle time and low utilization—and high communication overhead in expert parallelism. This paper proposes Collaborative-Constrained Routing (C2R), the first framework to establish an “expert collaboration–specialization” analytical paradigm. C2R explicitly regulates expert activation patterns via a collaboration-aware routing loss, a quantifiable expert specialization metric, lightweight gating regularization, and All2All communication optimization compatible with MegaBlocks—all without sacrificing model accuracy. Evaluated on LLaMA-MoE and Qwen-MoE, C2R improves downstream task performance by 0.51% and 0.33%, respectively, while significantly reducing inter-GPU All2All communication costs. Total training time decreases by 20–30% over state-of-the-art methods, achieving joint optimization of load balancing and communication efficiency.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.
Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced expert activation in MoE models
Reduces communication overhead in expert parallelism
Improves expert utilization via specialized routing strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaboration-Constrained Routing (C2R) strategy
Encourages specialized expert groups
Reduces communication overhead significantly