🤖 AI Summary
Existing work lacks a systematic understanding of how hyperparameters in Mixture-of-Experts (MoE) architectures co-scale with network width, number of experts, sparsity, and depth, often leading to training instability and unpredictable performance. This work addresses this gap by introducing the “Maximum Scale Stability Principle” (MSSP), grounded in dynamical mean-field theory (DMFT), which overcomes limitations of the traditional muP framework. MSSP provides the first unified and robust parametrization for MoE models across diverse scaling regimes and yields theoretical guidance tailored to both SGD and Adam optimizers. Empirical results demonstrate that MSSP consistently enables learning rate transferability and monotonic performance improvement with scale across various scaling strategies. Combined with established depth-scaling theories, MSSP establishes a comprehensive scaling principle for MoE architectures.
📝 Abstract
Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($μ$) desiderata. We then show that the resulting $μ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $μ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.