Hyperparameter Transfer with Mixture-of-Expert Layers

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the high hyperparameter tuning cost and poor cross-scale transferability in sparse Mixture-of-Experts (MoE) models, which arise from additional hyperparameters and architectural dimensions introduced during scaling. The authors propose a novel Transformer parameterization grounded in Dynamical Mean Field Theory (DMFT), offering the first theoretical foundation for MoE architectures. This approach enables efficient hyperparameter transfer across model scales—from 51M to 2B parameters—under a fixed computational budget. By performing only short training runs on small-scale models, the method yields hyperparameters that generalize effectively to larger models, significantly reducing tuning overhead while preserving strong performance at scale.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

Problem

Research questions and friction points this paper is trying to address.

Hyperparameter Transfer

Mixture-of-Experts

Neural Network Scaling

Router Weights

Model Architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperparameter Transfer

Mixture-of-Experts

Dynamical Mean-Field Theory