$\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

This work addresses the limitations of traditional Mixture of Experts (MoE) architectures, which suffer from training instability as the number of experts grows and are constrained by discrete, independent expert combinations that hinder scalability and performance. To overcome these challenges, the authors propose ∞-MoE, the first framework to extend MoE to a continuous, infinite expert space. By enabling token-dependent continuous sampling over a subset of parameters within a large feedforward network, ∞-MoE achieves dynamic expert selection and sparse activation. This approach allows flexible adjustment of the effective number of experts during inference while maintaining computational efficiency and significantly enhancing model expressivity. Evaluated on the GPT-2 Small architecture, ∞-MoE with only 129M active parameters nearly matches the performance of GPT-2 Medium (350M parameters) and improves accuracy by up to 2.5% over conventional MoE variants.

Technology Category

Application Category

📝 Abstract

The Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance. In conventional MoE, each expert is treated as entirely independent, and experts are combined in a discrete space. As a result, when the number of experts increases, it becomes difficult to train each expert effectively. To stabilize training while increasing the number of experts, we propose $\infty$-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token. By considering experts in a continuous space, this approach allows for an infinite number of experts while maintaining computational efficiency. Experiments show that a GPT-2 Small-based $\infty$-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters. Adjusting the number of sampled experts at inference time allows for a flexible trade-off between accuracy and speed, with an improvement of up to 2.5\% in accuracy over conventional MoE.

Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts

infinite experts

continuous space

expert training

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts

continuous experts

parameter sampling