🤖 AI Summary
This study investigates the efficient transfer of optimal hyperparameters from small-scale to large-scale language models to enhance training efficiency and stability. The authors develop a quantitative framework to evaluate hyperparameter transferability and conduct systematic experiments to analyze how different parametrizations affect learning rate transfer quality. They find that the advantage of Maximal Update parametrization (μP) over Standard Parametrization (SP) primarily stems from a substantially higher learning rate in the embedding layer, rather than the mechanism originally hypothesized in μP theory. To comprehensively assess transfer performance, three evaluation metrics are proposed. Notably, scaling the embedding layer’s learning rate by the width factor under SP significantly improves training stability and achieves transfer performance approaching that of μP. The study also reveals that while weight decay improves scaling law fits, it reduces robustness in extrapolation.
📝 Abstract
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.