Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In modern scale-invariant architectures, normalization layers induce scale sensitivity in backpropagation, undermining the cross-width transferability of μP learning rates. Method: We observe that the singular value spectrum shape of matrix parameters remains approximately invariant—scaling only by √(η/λ)—and propose a weight decay scaling law λ₂ ∝ √d. Combined with μP learning rate rules, this enables zero-shot, joint width transfer of both learning rate and weight decay under AdamW. Contribution/Results: We introduce two diagnostic tools—layer-wise gain invariance and top-singular-value matching—to enforce sublayer-scale consistency. Experiments on LLaMA-style Transformers and synthetic tasks demonstrate seamless, hyperparameter-free transfer from proxy to target widths, significantly reducing the cost of hyperparameter tuning for large language models.

Technology Category

Application Category

📝 Abstract

Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($μ$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $μ$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $sqrt{η/λ}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $sqrt{η/λ}cdot d^{0.75}$. Combining this observation with the $μ$P learning-rate rule $η_2propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $λ_2propto sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $η_1=Θ_d(1)$ and $λ_1=0$, this yields emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $μ$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.

Problem

Research questions and friction points this paper is trying to address.

Addresses learning rate transfer failure in scale-invariant neural network architectures

Develops weight-decay scaling rules for AdamW to maintain sublayer gain across widths

Enables zero-shot hyperparameter transfer without per-width tuning sweeps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight-decay scaling rule for AdamW optimizer

Preserves sublayer gain across network widths

Enables zero-shot hyperparameter transfer without tuning

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models