Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing Mixture-of-Experts (MoE) models employ fixed Top-K routing, causing sharp performance degradation during inference when adjusting the number of activated experts. This work proposes Matryoshka MoE (M-MoE), a novel training framework that—uniquely—dynamically samples a variable number of experts during training. It integrates intra-layer stochastic expert activation with progressive expert ranking to construct a hierarchical capability structure, ranging from coarse- to fine-grained. As a result, a single M-MoE model achieves near-optimal performance across diverse computational budgets (i.e., varying K values), matching that of dedicated models trained for each K, while enabling layer-wise elastic allocation of compute resources. Experiments demonstrate that M-MoE maintains high accuracy across multiple expert configurations, significantly reduces training cost compared to training separate models per K, and enhances inference flexibility and deployment efficiency.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.

Problem

Research questions and friction points this paper is trying to address.

Enables elastic expert activation without performance degradation

Learns coarse-to-fine expert ranking through varied training activation

Achieves computational elasticity matching specialist models efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training with varying activated experts for elasticity

Learning coarse-to-fine expert ranking structure

Layer-wise randomization strategy for granular control

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions