The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Learning rate scheduling in large language model training lacks rigorous theoretical foundations, leading to heuristic designs and suboptimal convergence. Method: This paper establishes, for the first time, a quantitative alignment between practical schedulers (e.g., linear decay) and tight non-smooth convex optimization lower bounds—eliminating spurious logarithmic factors in prior analyses and enabling principled cross-scheduler optimal learning rate transfer. We integrate convex optimization theory, scheduler modeling, and empirical validation, conducting systematic evaluations on 124M- and 210M-parameter Llama models. Results: Theory-guided scheduler design yields faster convergence and improved stability, empirically validating optimization theory’s practical relevance for large-model training. Core contribution: bridging the gap between theoretical performance bounds and engineering schedulers by providing a transferable, interpretable, and theoretically grounded framework for learning rate tuning.

Technology Category

Application Category

📝 Abstract

We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.

Problem

Research questions and friction points this paper is trying to address.

Large Model Training

Learning Rate Adjustment

Training Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning Rate Adjustment

Complex Convex Optimization

Llama Model Training

🔎 Similar Papers

Optimization Hyper-parameter Laws for Large Language Models