Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge of effectively translating macroscopic scaling laws of large language models into actionable Mixture-of-Experts (MoE) architecture configurations amidst a combinatorially explosive design space. The authors propose a reusable two-stage optimization framework that jointly constrains FLOPs/token, activated parameters, and total parameters to efficiently decompose the 16-dimensional search space. They introduce, for the first time, a fairness metric accounting for computational density disparities, revealing that near-optimal configuration bands widen as model scale increases. By integrating algebraic constraint elimination, rank-preserving properties of hidden dimensions, and multivariate scaling law fitting, the framework is validated across hundreds of MoE models spanning six orders of magnitude in compute budget, yielding robust scaling laws that accurately guide full MoE architecture design under arbitrary computational constraints.

Technology Category

Application Category

📝 Abstract

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.

Problem

Research questions and friction points this paper is trying to address.

Scaling Laws

Mixture-of-Experts

Architectural Optimization

Compute Budget

Design Space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

scaling laws

architectural optimization