A Proof of Learning Rate Transfer under $μ$P

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work investigates the asymptotic behavior of the optimal learning rate with respect to network width for linear multilayer perceptrons under different parameterizations, aiming to theoretically explain the empirical phenomenon of “learning rate transferability” across widths. We conduct rigorous theoretical analysis comparing standard parameterization (SP), neural tangent parameterization (NTP), and μ-parameterization (μP). Our analysis proves that, as width tends to infinity, the optimal learning rate converges to a nonzero constant under μP—enabling width-agnostic learning rate selection—whereas it vanishes under both SP and NTP, precluding such transferability. To our knowledge, this is the first formal proof establishing μP’s unique asymptotic stability in optimal learning rate scaling. Extensive large-scale experiments corroborate the theoretical predictions across diverse architectures and widths. These results provide a principled theoretical foundation for learning rate tuning in deep learning and reveal μP’s intrinsic robustness under width scaling.

Technology Category

Application Category

📝 Abstract

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $μP$, the optimal learning rate converges to a emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

Problem

Research questions and friction points this paper is trying to address.

Proving learning rate transfer in μP-parametrized linear MLPs

Showing optimal learning rate becomes constant at infinite width

Comparing μP behavior with SP and NTP parametrizations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proves learning rate transfer in μP linear MLP

Shows optimal learning rate converges to constant

Compares μP with SP and NTP parametrizations

🔎 Similar Papers

An Empirical Study of $mu$P Learning Rate Transfer