$φ$-Balancing for Mixture-of-Experts Training

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the imbalance in expert utilization commonly observed in Mixture-of-Experts models, which stems from heuristic routing strategies and noise in small-batch training. The authors propose φ-Balancing, a novel framework that formalizes global expert load balancing as a convex optimization problem for the first time. By minimizing a strictly convex, symmetric, and differentiable potential function of the expected routing distribution, φ-Balancing achieves holistic load balancing. Leveraging convex duality theory, the authors derive an efficient online mirror descent algorithm, enhanced with exponential moving averages to enable low-overhead, unbiased, and stable routing adjustments. Experimental results demonstrate that φ-Balancing significantly outperforms Switch-style routing and lossless baselines in both large-scale pretraining and downstream fine-tuning, yielding improved expert utilization efficiency and training stability.

📝 Abstract

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $φ$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $φ$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

load balancing

expert utilization

population-level balance

routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

load balancing

population-level optimization