🤖 AI Summary
This work addresses the limitations of existing Mixture-of-Experts (MoE) models, whose capacity is constrained by physical depth and width under a fixed per-token computational budget. To overcome this, the authors propose the Mixture of Universal Experts (MoUE) architecture, which reuses a unified expert pool across layers, effectively converting model depth into virtual width while maintaining constant activation counts and enhancing representational capacity. To mitigate challenges arising from expert reuse—namely path explosion and load imbalance—the approach introduces an interleaved rotary topology, a depth-aware load-balancing algorithm, and a multi-step consistent routing mechanism with lightweight trajectory state. Experiments demonstrate that MoUE consistently outperforms MoE baselines by up to 1.3% across various scales and enables progressive upgrades of existing MoE models, yielding performance gains of up to 4.2%.
📝 Abstract
Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.