Slicing and Dicing: Configuring Optimal Mixtures of Experts

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

247K/year
🤖 AI Summary
This work systematically investigates the interplay among key design dimensions in Mixture-of-Experts (MoE) architectures—such as the number of experts, expert granularity, heterogeneity, shared experts, and load balancing—through over 2,000 large-scale pretraining experiments. The study reveals that the number of experts and their granularity are the dominant factors governing model performance, while other design choices exert comparatively limited influence. Notably, increasing the total MoE parameters consistently enhances performance across all active parameter budgets, and the optimal expert size is determined solely by the number of active parameters. Furthermore, the effectiveness of dropless routing is empirically validated, demonstrating consistent performance gains.
📝 Abstract
Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
expert count
expert granularity
load balancing
shared experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
expert granularity
systematic ablation study
dropless routing
active parameter scaling