🤖 AI Summary
In deep learning, model initialization and learning rate tuning are often guided by empirical rules or heuristics due to prohibitive computational costs of systematic hyperparameter search. Method: This work conducts the first large-scale empirical validation of μ-Parameterization (μP) for learning rate transfer across model sizes in Transformer architectures—specifically at unprecedented scale (10B parameters, 190B tokens)—to enable zero-shot inference of near-optimal learning rates for large models from small ones. Using massive distributed training and ablation studies, we evaluate μP’s efficacy across diverse configurations. Contribution/Results: μP achieves near-optimal convergence in most settings: <5% accuracy degradation on a 1.2B model and 2–3× faster training on a 10B model compared to baselines. The study confirms μP’s practical universality while identifying its failure modes and proposing targeted refinements. These advances substantially reduce hyperparameter optimization cost for large language models, bridging theory and practice in scalable neural network training.
📝 Abstract
Deep learning models have become a cornerstone of modern AI research, yet their initializations and learning rates may at times be set in an opaque or ad-hoc fashion due to the high cost of hyperparameter sweeps. The $mu$-Parameterization ($mu$P) offers a possible solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work considers $mu$P empirically, focusing on the popular transformer architecture, and aims to answer a simple question: does $mu$-Transfer yield near-optimal learning rates in practice? Studying over a dozen ablations with up to 1.2B parameters and 33B tokens and a large-scale experiment with up to 10B parameters and 190B tokens, we observe a positive answer for most settings, and discuss improvements otherwise.