Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing scaling laws focus solely on final loss, neglecting training dynamics and the impact of learning rate schedules (LRS). Method: We propose Functional Scaling Laws (FSL), the first framework to model training dynamics via intrinsic time and stochastic differential equations (SDEs), analytically characterizing LRS effects as convolutional functional terms. Leveraging a teacher–student kernel regression setup, we integrate online SGD analysis with FSL fitting to jointly capture population risk evolution. Contribution/Results: FSL provides a unified theoretical explanation for empirical phenomena—including efficient large-model training and the benefits of learning rate decay. Experiments across 0.1B–1B parameter models demonstrate high-fidelity fitting and predictive accuracy of training loss curves. Crucially, FSL establishes an interpretable, principled foundation for LRS design, bridging theory and practice in deep learning optimization.

Technology Category

Application Category

📝 Abstract
Scaling laws have played a cornerstone role in guiding the training of large language models (LLMs). However, most existing works on scaling laws primarily focus on the final-step loss, overlooking the loss dynamics during the training process and, crucially, the impact of learning rate schedule (LRS). In this paper, we aim to bridge this gap by studying a teacher-student kernel regression setup trained via online stochastic gradient descent (SGD). Leveraging a novel intrinsic time viewpoint and stochastic differential equation (SDE) modeling of SGD, we introduce the Functional Scaling Law (FSL), which characterizes the evolution of population risk during the training process for general LRSs. Remarkably, the impact of the LRSs is captured through an explicit convolution-type functional term, making their effects fully tractable. To illustrate the utility of FSL, we analyze three widely used LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- under both data-limited and compute-limited regimes. We provide theoretical justification for widely adopted empirical practices in LLMs pre-training such as (i) higher-capacity models are more data- and compute-efficient; (ii) learning rate decay can improve training efficiency; (iii) WSD-like schedules can outperform direct-decay schedules. Lastly, we explore the practical relevance of FSL as a surrogate model for fitting, predicting and optimizing the loss curves in LLM pre-training, with experiments conducted across model sizes ranging from 0.1B to 1B parameters. We hope our FSL framework can deepen the understanding of LLM pre-training dynamics and provide insights for improving large-scale model training.
Problem

Research questions and friction points this paper is trying to address.

Existing scaling laws overlook loss dynamics and learning rate schedule impacts
The paper bridges this gap by studying training evolution via functional scaling laws
It analyzes how different learning rate schedules affect LLM training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Functional Scaling Law for learning rate schedules
Uses stochastic differential equation modeling of SGD
Analyzes constant exponential and warmup decay schedules
🔎 Similar Papers
No similar papers found.
B
Binghui Li
Center for Machine Learning Research, Peking University
Fengling Chen
Fengling Chen
MOE Key Laboratory of Bioinformatics, Department of Automation, Tsinghua University
statistical learningbioinformatics
Zixun Huang
Zixun Huang
Shenzhen Polytechnic University,Hong Kong Polytechnic University
Medical Image AnalysisMedical Image Segmentation
Lean Wang
Lean Wang
Peking University
Large Language Models
L
Lei Wu
Center for Machine Learning Research, Peking University; School of Mathematical Sciences, Peking University; AI for Science Institute, Beijing