$\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional Mixture of Experts (MoE) architectures, which suffer from training instability as the number of experts grows and are constrained by discrete, independent expert combinations that hinder scalability and performance. To overcome these challenges, the authors propose ∞-MoE, the first framework to extend MoE to a continuous, infinite expert space. By enabling token-dependent continuous sampling over a subset of parameters within a large feedforward network, ∞-MoE achieves dynamic expert selection and sparse activation. This approach allows flexible adjustment of the effective number of experts during inference while maintaining computational efficiency and significantly enhancing model expressivity. Evaluated on the GPT-2 Small architecture, ∞-MoE with only 129M active parameters nearly matches the performance of GPT-2 Medium (350M parameters) and improves accuracy by up to 2.5% over conventional MoE variants.

Technology Category

Application Category

📝 Abstract
The Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance. In conventional MoE, each expert is treated as entirely independent, and experts are combined in a discrete space. As a result, when the number of experts increases, it becomes difficult to train each expert effectively. To stabilize training while increasing the number of experts, we propose $\infty$-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token. By considering experts in a continuous space, this approach allows for an infinite number of experts while maintaining computational efficiency. Experiments show that a GPT-2 Small-based $\infty$-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters. Adjusting the number of sampled experts at inference time allows for a flexible trade-off between accuracy and speed, with an improvement of up to 2.5\% in accuracy over conventional MoE.
Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts
infinite experts
continuous space
expert training
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts
continuous experts
parameter sampling
infinite experts
efficient inference
🔎 Similar Papers
No similar papers found.
S
Shota Takashiro
The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo, 113-8656 Japan
T
Takeshi Kojima
The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo, 113-8656 Japan
Shohei Taniguchi
Shohei Taniguchi
The University of Tokyo
machine learning
Yusuke Iwasawa
Yusuke Iwasawa
The University of Tokyo
deep learningtransfer learningfoundation modelmeta learning
Yutaka Matsuo
Yutaka Matsuo
Professor, University of Tokyo
deep learningweb miningartificial intelligence