🤖 AI Summary
This work provides a rigorous theoretical foundation for learning rate warmup—a widely adopted yet poorly understood deep learning practice. Addressing the question of why warmup accelerates convergence, we introduce a novel theoretical framework based on generalized $(L_0, L_1)$-smoothness, which for the first time characterizes the evolution of local curvature of the loss function during early training. Leveraging this, we establish formal convergence guarantees for warmup strategies and derive tight upper and lower complexity bounds—revealing that acceleration stems from adaptive avoidance of highly curved initial regions. Our analysis applies to multilayer neural networks under both mean squared error and cross-entropy losses. Extensive experiments on language and vision models confirm that warmup significantly improves convergence speed and robustness compared to fixed learning rates.
📝 Abstract
Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.