Weight Decay may matter more than muP for Learning Rate Transfer in Practice

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of learning rate transfer across model widths in large-scale neural network training. Through extensive empirical analysis, we identify weight decay—not muP scaling—as the key mechanism for stabilizing update dynamics and enabling effective learning rate transfer; muP, in practice, merely provides implicit learning rate warmup. Accordingly, we abandon strict muP parameterization and propose an efficient alternative: explicit learning rate warmup scheduling coupled with adaptive weight decay tuning. Our method achieves learning rate transfer performance on par with—or even surpassing—muP across diverse settings, including large language models (LLMs). This challenges the prevailing assumption of muP’s theoretical necessity for width-invariant training, offering a simpler, more robust, and deployment-friendly paradigm for hyperparameter optimization in large models. (132 words)

Technology Category

Application Category

📝 Abstract
Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of muP rely on strong assumptions, particularly about the geometric alignment of a layer's inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests muP's scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical practice such as why muP requires the independent weight decay variant for successful transfer.
Problem

Research questions and friction points this paper is trying to address.

Evaluating learning rate transfer methods across neural network widths
Challenging muP assumptions about stable update dynamics in practice
Identifying weight decay as key factor for learning rate transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight decay stabilizes internal representation updates across widths
Modified warmup schedules can largely replace muP scaling
muP primarily acts as implicit learning rate warmup
🔎 Similar Papers
Atli Kosson
Atli Kosson
PhD Student, EPFL
Machine LearningComputer VisionNeural Networks
J
Jeremy Welborn
Amazon FAR (Frontier AI & Robotics)
Y
Yang Liu
Amazon FAR (Frontier AI & Robotics)
Martin Jaggi
Martin Jaggi
EPFL
Machine LearningOptimization
X
Xi Chen
Amazon FAR (Frontier AI & Robotics)