Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing scaling laws focus solely on final loss, neglecting training dynamics and the impact of learning rate schedules (LRS). Method: We propose Functional Scaling Laws (FSL), the first framework to model training dynamics via intrinsic time and stochastic differential equations (SDEs), analytically characterizing LRS effects as convolutional functional terms. Leveraging a teacher–student kernel regression setup, we integrate online SGD analysis with FSL fitting to jointly capture population risk evolution. Contribution/Results: FSL provides a unified theoretical explanation for empirical phenomena—including efficient large-model training and the benefits of learning rate decay. Experiments across 0.1B–1B parameter models demonstrate high-fidelity fitting and predictive accuracy of training loss curves. Crucially, FSL establishes an interpretable, principled foundation for LRS design, bridging theory and practice in deep learning optimization.

Technology Category

Application Category

📝 Abstract

Scaling laws have played a cornerstone role in guiding the training of large language models (LLMs). However, most existing works on scaling laws primarily focus on the final-step loss, overlooking the loss dynamics during the training process and, crucially, the impact of learning rate schedule (LRS). In this paper, we aim to bridge this gap by studying a teacher-student kernel regression setup trained via online stochastic gradient descent (SGD). Leveraging a novel intrinsic time viewpoint and stochastic differential equation (SDE) modeling of SGD, we introduce the Functional Scaling Law (FSL), which characterizes the evolution of population risk during the training process for general LRSs. Remarkably, the impact of the LRSs is captured through an explicit convolution-type functional term, making their effects fully tractable. To illustrate the utility of FSL, we analyze three widely used LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- under both data-limited and compute-limited regimes. We provide theoretical justification for widely adopted empirical practices in LLMs pre-training such as (i) higher-capacity models are more data- and compute-efficient; (ii) learning rate decay can improve training efficiency; (iii) WSD-like schedules can outperform direct-decay schedules. Lastly, we explore the practical relevance of FSL as a surrogate model for fitting, predicting and optimizing the loss curves in LLM pre-training, with experiments conducted across model sizes ranging from 0.1B to 1B parameters. We hope our FSL framework can deepen the understanding of LLM pre-training dynamics and provide insights for improving large-scale model training.

Problem

Research questions and friction points this paper is trying to address.

Existing scaling laws overlook loss dynamics and learning rate schedule impacts

The paper bridges this gap by studying training evolution via functional scaling laws

It analyzes how different learning rate schedules affect LLM training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Functional Scaling Law for learning rate schedules

Uses stochastic differential equation modeling of SGD

Analyzes constant exponential and warmup decay schedules

🔎 Similar Papers

Optimization Hyper-parameter Laws for Large Language Models