Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing Mixture-of-Experts (MoE) models, whose capacity is constrained by physical depth and width under a fixed per-token computational budget. To overcome this, the authors propose the Mixture of Universal Experts (MoUE) architecture, which reuses a unified expert pool across layers, effectively converting model depth into virtual width while maintaining constant activation counts and enhancing representational capacity. To mitigate challenges arising from expert reuse—namely path explosion and load imbalance—the approach introduces an interleaved rotary topology, a depth-aware load-balancing algorithm, and a multi-step consistent routing mechanism with lightweight trajectory state. Experiments demonstrate that MoUE consistently outperforms MoE baselines by up to 1.3% across various scales and enables progressive upgrades of existing MoE models, yielding performance gains of up to 4.2%.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
scalability
model capacity
depth-width limitation
virtual width
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts
Virtual Width
Universal Experts
Depth-Width Transformation
Load Balancing
🔎 Similar Papers
No similar papers found.
Y
Yilong Chen
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
N
Naibin Gu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Junyuan Shang
Junyuan Shang
Baidu NLP
Deep LearningNatural Language ProcessingHealthcare
Zhenyu Zhang
Zhenyu Zhang
Baidu Inc.
Natural Language ProcessingLarge Language ModelMultimodal Language Model
Y
Yuchen Feng
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Jiawei Sheng
Jiawei Sheng
Institute of Information Engineering, Chinese Academy of Sciences
Knowledge GraphRecommendationNatural Language Processing
Tingwen Liu
Tingwen Liu
Institute of Information Engineering, Chinese Academy of Sciences
Content SecurityNatural Language ProcessingKnowledge Graph
Shuohuan Wang
Shuohuan Wang
Baidu
Natural Language ProcessingDeep Learning
Yu Sun
Yu Sun
Baidu
Natural Language ProcessingDeep Learning
H
Hua Wu
Baidu Inc.
Haifeng Wang
Haifeng Wang
Baidu
NLPMTSearchSpeechData Mining