🤖 AI Summary
Learning rate scheduling in large language model training lacks rigorous theoretical foundations, leading to heuristic designs and suboptimal convergence. Method: This paper establishes, for the first time, a quantitative alignment between practical schedulers (e.g., linear decay) and tight non-smooth convex optimization lower bounds—eliminating spurious logarithmic factors in prior analyses and enabling principled cross-scheduler optimal learning rate transfer. We integrate convex optimization theory, scheduler modeling, and empirical validation, conducting systematic evaluations on 124M- and 210M-parameter Llama models. Results: Theory-guided scheduler design yields faster convergence and improved stability, empirically validating optimization theory’s practical relevance for large-model training. Core contribution: bridging the gap between theoretical performance bounds and engineering schedulers by providing a transferable, interpretable, and theoretically grounded framework for learning rate tuning.
📝 Abstract
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.