A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing studies lack a quantitative model characterizing how learning rate schedules affect loss evolution during large language model pretraining. Method: We propose the Multi-Power-Law Loss Evolution Law—a unified analytical framework that models loss curves under canonical schedules (e.g., constant, cosine, stepwise) via decoupled components: the cumulative effect of learning rate and the additional convergence gain induced by decay. This enables cross-schedule generalization and accurate prediction of unseen schedule shapes or durations from minimal fitting data. Contribution/Results: Leveraging this law, we automatically discover a novel Warmup-Stable-Decay schedule structure, which consistently achieves significantly lower final loss than standard cosine annealing across diverse model scales and architectures. The model provides an interpretable, optimization-friendly theoretical tool for efficient pretraining—requiring only sparse schedule-loss observations for high-fidelity loss curve prediction.

Technology Category

Application Category

📝 Abstract
Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.
Problem

Research questions and friction points this paper is trying to address.

Predicts pretraining loss evolution across learning rate schedules.
Validates multi-power law on various model sizes and architectures.
Optimizes learning rate schedules to minimize final pretraining loss.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-power law predicts loss curves
Validated across model sizes and architectures
Automatically discovers optimal learning rate schedules
🔎 Similar Papers
No similar papers found.
K
Kairong Luo
Department of Computer Science and Technology, Tsinghua University
H
Haodong Wen
Qian Xuesen College, Xi’an Jiaotong University
Shengding Hu
Shengding Hu
Tsinghua University
LLMArtificial Super Intelligence
Z
Zhenbo Sun
Department of Computer Science and Technology, Tsinghua University
Z
Zhiyuan Liu
Department of Computer Science and Technology, Tsinghua University
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing
Kaifeng Lyu
Kaifeng Lyu
Tsinghua University
W
Wenguang Chen
Department of Computer Science and Technology, Tsinghua University, Peng Cheng Laboratory