Why Do We Need Warm-up? A Theoretical Perspective

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work provides a rigorous theoretical foundation for learning rate warmup—a widely adopted yet poorly understood deep learning practice. Addressing the question of why warmup accelerates convergence, we introduce a novel theoretical framework based on generalized $(L_0, L_1)$-smoothness, which for the first time characterizes the evolution of local curvature of the loss function during early training. Leveraging this, we establish formal convergence guarantees for warmup strategies and derive tight upper and lower complexity bounds—revealing that acceleration stems from adaptive avoidance of highly curved initial regions. Our analysis applies to multilayer neural networks under both mean squared error and cross-entropy losses. Extensive experiments on language and vision models confirm that warmup significantly improves convergence speed and robustness compared to fixed learning rates.

Technology Category

Application Category

📝 Abstract

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Problem

Research questions and friction points this paper is trying to address.

Explaining theoretical foundations of learning rate warm-up in deep learning

Analyzing convergence benefits of warm-up schedules under generalized smoothness

Validating warm-up effectiveness through neural network experiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized smoothness condition for neural architectures

Warm-up schedule accelerates gradient descent convergence

Theoretical and empirical validation on vision and language models

🔎 Similar Papers

No similar papers found.