Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the lack of convergence guarantees for soft-routing Mixture-of-Experts (MoE) models under joint training of nonlinear routers and experts. We propose a provably correct feature learning framework grounded in a student–teacher paradigm. Methodologically, we model a moderately overparameterized MoE architecture, incorporate dynamic weighted aggregation and soft routing, and design a pruning-augmented fine-tuning strategy with provable convergence. Theoretically, we establish the first global convergence guarantee for the student network under joint training, proving exact recovery of teacher parameters. We further uncover an intrinsic gradient-guided mechanism by which experts shape router learning. Finally, we deliver a practical yet theoretically sound optimization paradigm: pruning preserves performance, and fine-tuning enjoys rigorous convergence guarantees. This work provides the first unified, interpretable theoretical lens into MoE training dynamics.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process is ``guided'' by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.

Problem

Research questions and friction points this paper is trying to address.

Analyzing joint training dynamics of soft-routed MoE with nonlinear components

Proving feature learning phase where router learns from expert guidance

Establishing convergence guarantees for pruning and fine-tuning optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft-routed MoE with joint expert-router training

Feature learning guided by experts in teacher-student framework

Post-training pruning and fine-tuning for optimality

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL