Grouter: Decoupling Routing from Representation for Accelerated MoE Training

📅 2026-02-22

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This work addresses the slow convergence and training instability commonly observed in traditional Mixture-of-Experts (MoE) models, which stem from the joint optimization of routing policies and expert weights. To overcome these limitations, the authors propose Grouter, a novel approach that introduces a preset routing mechanism: high-quality routing structures are distilled from a pre-trained MoE model and then fixed, effectively decoupling routing optimization from expert weight updates. Grouter further incorporates expert folding, expert fine-tuning, and structure-prior-guided training strategies to enable efficient adaptation across diverse model configurations and data distributions. Experimental results demonstrate that Grouter improves training data utilization by 4.28× and achieves up to 33.5% higher throughput, significantly enhancing both the efficiency and performance of MoE training.

📝 Abstract

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

routing

training instability

convergence

structural priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

preemptive routing

decoupled optimization

Mixture-of-Experts