On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work resolves the paradox between theoretical predictions—instability under large learning rates and feature freezing under small learning rates in the infinite-width limit—and empirical observations of slow learning rate decay in cross-entropy (CE) optimization. Through continuous-time dynamical analysis and extensive experiments across architectures (MLP, GPT), optimizers (SGD, Adam), and modalities, we identify a “controlled divergence” regime unique to CE loss: logits diverge linearly over time, yet loss, gradients, and activations remain bounded, enabling persistent, full-layer feature evolution. In contrast, mean-squared error (MSE) loss lacks this property. Our analysis is the first to ground the efficacy of large learning rates in the intrinsic geometry of the CE loss function. We establish a rigorous theoretical connection between width scaling laws and the optimal learning rate decay exponent, and derive tight theoretical bounds for layer-adaptive learning rate schedules.

Technology Category

Application Category

📝 Abstract

The dominant paradigm for training large-scale vision and language models is He initialization and a single global learning rate ( extit{standard parameterization}, SP). Despite its practical success, standard parametrization remains poorly understood from a theoretical perspective: Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates. However, empirically optimal learning rates consistently decay much slower than theoretically predicted. By carefully studying neural network training dynamics, we demonstrate that this discrepancy is not fully explained by finite-width phenomena such as catapult effects or a lack of alignment between weights and incoming activations. We instead show that the apparent contradiction can be fundamentally resolved by taking the loss function into account: In contrast to Mean Squared Error (MSE) loss, we prove that under cross-entropy (CE) loss, an intermediate extit{controlled divergence} regime emerges, where logits diverge but loss, gradients, and activations remain stable. Stable training under large learning rates enables persistent feature evolution at scale in all hidden layers, which is crucial for the practical success of SP. In experiments across optimizers (SGD, Adam), architectures (MLPs, GPT) and data modalities (vision, language), we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scalings for standard initialization.

Problem

Research questions and friction points this paper is trying to address.

Understanding why large learning rates work well in standard parameterization

Resolving discrepancy between empirical and theoretical learning rate predictions

Exploring controlled divergence regime under cross-entropy loss for stable training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled divergence regime under cross-entropy loss

Persistent feature evolution with large learning rates

Width-scaling predicts optimal learning rate exponents

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models