🤖 AI Summary
This work addresses the imbalance in expert utilization commonly observed in Mixture-of-Experts models, which stems from heuristic routing strategies and noise in small-batch training. The authors propose φ-Balancing, a novel framework that formalizes global expert load balancing as a convex optimization problem for the first time. By minimizing a strictly convex, symmetric, and differentiable potential function of the expected routing distribution, φ-Balancing achieves holistic load balancing. Leveraging convex duality theory, the authors derive an efficient online mirror descent algorithm, enhanced with exponential moving averages to enable low-overhead, unbiased, and stable routing adjustments. Experimental results demonstrate that φ-Balancing significantly outperforms Switch-style routing and lossless baselines in both large-scale pretraining and downstream fine-tuning, yielding improved expert utilization efficiency and training stability.
📝 Abstract
Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $φ$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $φ$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.