Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of designing optimal learning rate schedules under a fixed training budget to balance signal acquisition and noise forgetting. Within the framework of function scaling laws, the authors analyze source and capacity exponents to reveal a phase transition in the optimal schedule dictated by task difficulty: power-law decay is optimal for easy tasks, whereas difficult tasks require a warmup-stable-decay (WSD) structure. The study provides the first rigorous theoretical justification for both scheduling paradigms, eliminating the logarithmic suboptimality present in prior analyses. In the setting of single-pass SGD for kernel regression, the proposed schedules achieve minimax-optimal convergence rates. Numerical experiments corroborate the theoretically predicted schedule shapes and demonstrate their superior empirical performance.

Technology Category

Application Category

📝 Abstract
We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $\beta>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule follows a power decay to zero, $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$, where the peak learning rate scales as $\eta_{\mathrm{peak}} \eqsim N^{-\nu}$ for an explicit exponent $\nu = \nu(s,\beta)$. In contrast, in the hard-task regime $s<1 - 1/\beta$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.
Problem

Research questions and friction points this paper is trying to address.

learning-rate schedule
functional scaling laws
optimal scheduling
stochastic gradient descent
minimax optimality
Innovation

Methods, ideas, or system contributions that make the work stand out.

optimal learning-rate schedule
functional scaling laws
power decay
warmup-stable-decay
minimax optimality
🔎 Similar Papers
No similar papers found.
Binghui Li
Binghui Li
CMLR, Peking University
machine learningdeep learning theory
Zilin Wang
Zilin Wang
University of Oxford
Deep Reinforcement LearningAutonomous Driving
Fengling Chen
Fengling Chen
MOE Key Laboratory of Bioinformatics, Department of Automation, Tsinghua University
statistical learningbioinformatics
S
Shiyang Zhao
School of Mathematical Sciences, Peking University
R
Ruiheng Zheng
School of Mathematical Sciences, Peking University
L
Lei Wu
Center for Machine Learning Research, Peking University; School of Mathematical Sciences, Peking University; AI for Science Institute, Beijing