Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevailing practice in Mixture-of-Experts (MoE) language model design, which often relies solely on total and activated parameter counts without a systematic understanding of performance determinants. Through theoretical analysis and empirical validation, the study reveals that MoE performance is jointly governed by total parameter count and expert sparsity, and further demonstrates that the number of experts and the top-k selection value are not interchangeable within sparsity configurations. Building on these insights, the authors propose a new design principle: under fixed memory and inference constraints, one should maximize total parameters while minimizing both the number of experts and sparsity. This leads to a unified framework for MoE architecture optimization that substantially reduces design ambiguity and offers clear guidance for developing efficient MoE models.

Technology Category

Application Category

📝 Abstract
Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters ($N_{total}$) and expert sparsity ($s:=n_{exp}/n_{topk}$). Moreover, $n_{exp}$ and $n_{topk}$ do not"cancel out"within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes $N_{total}$ while minimizing $s$ (maximizing $n_{topk}$) and $n_{exp}$ under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
language models
architectural design
memory constraints
inference constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
expert sparsity
model design principle
memory constraint
inference efficiency
🔎 Similar Papers
No similar papers found.