Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Under strictly equal total parameter count, training FLOPs, and data budget, it remains unclear whether Mixture-of-Experts (MoE) architectures can consistently outperform dense language models. Method: We conduct systematic architecture search across scales and optimize expert activation rates under fixed resource constraints; introduce a data-reuse strategy to mitigate MoE’s reliance on additional training data; and train over 250 models (2B/7B-scale) on 50 trillion tokens. Contribution/Results: We establish a generalizable, resource-aware optimal MoE design paradigm. For the first time under fully matched resource conditions, we empirically demonstrate stable MoE superiority over dense baselines. We identify a robust optimal expert activation rate range (~2–4). All models are open-sourced. Results show the optimal MoE achieves an average 2.1% reduction in perplexity (PPL) across all configurations—providing both theoretical grounding and practical guidelines for efficient large-model design.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints - that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for the enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All models will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

Comparing MoE and dense LLMs under equal resource constraints

Optimizing MoE architecture for superior performance over dense models

Investigating data trade-offs and solutions in MoE model efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal MoE design maximizes performance under constraints

Optimal activation rate region outperforms dense models

Data reuse resolves performance-data trade-off

🔎 Similar Papers

No similar papers found.