The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Learning rate scheduling in large language model training lacks rigorous theoretical foundations, leading to heuristic designs and suboptimal convergence. Method: This paper establishes, for the first time, a quantitative alignment between practical schedulers (e.g., linear decay) and tight non-smooth convex optimization lower bounds—eliminating spurious logarithmic factors in prior analyses and enabling principled cross-scheduler optimal learning rate transfer. We integrate convex optimization theory, scheduler modeling, and empirical validation, conducting systematic evaluations on 124M- and 210M-parameter Llama models. Results: Theory-guided scheduler design yields faster convergence and improved stability, empirically validating optimization theory’s practical relevance for large-model training. Core contribution: bridging the gap between theoretical performance bounds and engineering schedulers by providing a transferable, interpretable, and theoretically grounded framework for learning rate tuning.

Technology Category

Application Category

📝 Abstract
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
Problem

Research questions and friction points this paper is trying to address.

Large Model Training
Learning Rate Adjustment
Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning Rate Adjustment
Complex Convex Optimization
Llama Model Training
🔎 Similar Papers
Fabian Schaipp
Fabian Schaipp
Inria Paris
OptimizationMachine Learning
A
Alexander Hagele
EPFL, Lausanne, Switzerland
Adrien Taylor
Adrien Taylor
Inria - Ecole Normale Supérieure
OptimizationNumerical analysisComputational mathematics
Umut Simsekli
Umut Simsekli
INRIA - École Normale Supérieure
Deep Learning TheoryLangevin Monte Carlo
F
Francis R. Bach
Inria, Departement d’Informatique de l’Ecole Normale Superieure, PSL Research University, Paris, France