CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Sparse Mixture-of-Experts (SMoE) models suffer from suboptimal routing: conventional routing decisions are decoupled from actual expert responses, leading to inefficient computation. This paper introduces Neural Response Competitive Routing—a novel, learnable routing mechanism that enables end-to-end SMoE training with statistical guarantees on sparsity and convergence. Unlike static or softmax-based routing, our method dynamically allocates tokens to experts based on their learned, context-sensitive response scores, with theoretical analysis proving superior sample efficiency. Empirically, the approach achieves state-of-the-art performance on vision instruction tuning and language pretraining benchmarks, outperforming existing SMoE baselines across accuracy, robustness, and scalability. It scales linearly with expert count while reducing training overhead—enabling efficient, high-capacity model deployment without sacrificing stability or generalization.

Technology Category

Application Category

📝 Abstract

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

Problem

Research questions and friction points this paper is trying to address.

Improving suboptimal routing in sparse mixture of experts

Enhancing sample efficiency via competition-based routing

Reducing training overhead while maintaining strong performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Competition mechanism for optimal expert routing

CompeteSMoE algorithm for efficient model training

Improved sample efficiency over softmax routing

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions