Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the problem of designing optimal learning rate schedules under a fixed training budget to balance signal acquisition and noise forgetting. Within the framework of function scaling laws, the authors analyze source and capacity exponents to reveal a phase transition in the optimal schedule dictated by task difficulty: power-law decay is optimal for easy tasks, whereas difficult tasks require a warmup-stable-decay (WSD) structure. The study provides the first rigorous theoretical justification for both scheduling paradigms, eliminating the logarithmic suboptimality present in prior analyses. In the setting of single-pass SGD for kernel regression, the proposed schedules achieve minimax-optimal convergence rates. Numerical experiments corroborate the theoretically predicted schedule shapes and demonstrate their superior empirical performance.

Technology Category

Application Category

📝 Abstract

We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $\beta>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule follows a power decay to zero, $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$, where the peak learning rate scales as $\eta_{\mathrm{peak}} \eqsim N^{-\nu}$ for an explicit exponent $\nu = \nu(s,\beta)$. In contrast, in the hard-task regime $s<1 - 1/\beta$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.

Problem

Research questions and friction points this paper is trying to address.

learning-rate schedule

functional scaling laws

optimal scheduling

stochastic gradient descent

minimax optimality

Innovation

Methods, ideas, or system contributions that make the work stand out.

optimal learning-rate schedule

functional scaling laws

power decay