An Empirical Study of $mu$P Learning Rate Transfer

📅 2024-04-08

📈 Citations: 3

✨ Influential: 0

career value

205K/year

🤖 AI Summary

In deep learning, model initialization and learning rate tuning are often guided by empirical rules or heuristics due to prohibitive computational costs of systematic hyperparameter search. Method: This work conducts the first large-scale empirical validation of μ-Parameterization (μP) for learning rate transfer across model sizes in Transformer architectures—specifically at unprecedented scale (10B parameters, 190B tokens)—to enable zero-shot inference of near-optimal learning rates for large models from small ones. Using massive distributed training and ablation studies, we evaluate μP’s efficacy across diverse configurations. Contribution/Results: μP achieves near-optimal convergence in most settings: <5% accuracy degradation on a 1.2B model and 2–3× faster training on a 10B model compared to baselines. The study confirms μP’s practical universality while identifying its failure modes and proposing targeted refinements. These advances substantially reduce hyperparameter optimization cost for large language models, bridging theory and practice in scalable neural network training.

Technology Category

Application Category

📝 Abstract

Deep learning models have become a cornerstone of modern AI research, yet their initializations and learning rates may at times be set in an opaque or ad-hoc fashion due to the high cost of hyperparameter sweeps. The $mu$-Parameterization ($mu$P) offers a possible solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work considers $mu$P empirically, focusing on the popular transformer architecture, and aims to answer a simple question: does $mu$-Transfer yield near-optimal learning rates in practice? Studying over a dozen ablations with up to 1.2B parameters and 33B tokens and a large-scale experiment with up to 10B parameters and 190B tokens, we observe a positive answer for most settings, and discuss improvements otherwise.

Problem

Research questions and friction points this paper is trying to address.

Empirical study of learning rate transfer

Effectiveness of μP in transformer models

Optimal learning rates in large-scale models

Innovation

Methods, ideas, or system contributions that make the work stand out.

μP learning rate transfer

empirical study on transformers

scales up to 10B parameters

🔎 Similar Papers

No similar papers found.