🤖 AI Summary
This work investigates the asymptotic behavior of the optimal learning rate with respect to network width for linear multilayer perceptrons under different parameterizations, aiming to theoretically explain the empirical phenomenon of “learning rate transferability” across widths. We conduct rigorous theoretical analysis comparing standard parameterization (SP), neural tangent parameterization (NTP), and μ-parameterization (μP). Our analysis proves that, as width tends to infinity, the optimal learning rate converges to a nonzero constant under μP—enabling width-agnostic learning rate selection—whereas it vanishes under both SP and NTP, precluding such transferability. To our knowledge, this is the first formal proof establishing μP’s unique asymptotic stability in optimal learning rate scaling. Extensive large-scale experiments corroborate the theoretical predictions across diverse architectures and widths. These results provide a principled theoretical foundation for learning rate tuning in deep learning and reveal μP’s intrinsic robustness under width scaling.
📝 Abstract
We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $μP$, the optimal learning rate converges to a emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.