🤖 AI Summary
This work proposes MoSE, a novel Mixture-of-Experts (MoE) architecture that introduces slimmable networks into MoE for the first time, enabling each expert to operate at variable widths and facilitating fine-grained, dynamic computation allocation within experts. Unlike conventional MoE models—which activate and fully execute a fixed set of experts, thereby allowing only discrete trade-offs between accuracy and computational cost—MoSE achieves a continuous accuracy–computation trade-off through joint multi-width training, sparse routing, and test-time width adaptation. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or exceeds the performance of standard MoE at full width while significantly advancing the Pareto frontier, attaining equal or higher accuracy with substantially fewer FLOPs.
📝 Abstract
Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.