Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

๐Ÿ“… 2026-03-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the optimal allocation of computational resources between expert and attention layers in Mixture-of-Experts (MoE) models. By introducing the expert-to-attention compute ratio \( r \) and incorporating total compute budget and model sparsity, the study revealsโ€”for the first timeโ€”a power-law relationship between the optimal ratio \( r^* \) and both total computation and sparsity, providing an explicit analytical expression. Building upon a GPT-style MoE Transformer architecture, the authors integrate neural scaling law analysis, large-scale experiments, and mathematical modeling to propose an extended Chinchilla scaling framework. This framework significantly enhances model performance under a fixed compute budget, offering a quantifiable and dynamic optimization principle for the design of efficient large-scale MoE models.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^*$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^*$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
compute allocation
neural scaling laws
expert-attention ratio
model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
neural scaling laws
compute allocation
expert-attention ratio
model sparsity
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Junzhuo Li
The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
P
Peijie Jiang
Ant Group
Changxin Tian
Changxin Tian
Renmin University of China & Ant Group
Large Language Models
J
Jia Liu
Ant Group
Zhiqiang Zhang
Zhiqiang Zhang
Ant Group
Machine Learning
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model