WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the complexity and limited generalization caused by reliance on learning rate (LR) decay in large language model pretraining. We propose the Warmup-Stable and Merge (WSM) framework, which unifies conventional LR decay into checkpoint model averaging and establishes, for the first time, a theoretical connection between LR decay and model merging—revealing merge duration as a critical performance determinant. WSM employs warmup-stable training followed by checkpoint aggregation, enabling decay-free optimization compatible with mainstream optimizers and existing training pipelines. Empirical evaluation shows consistent improvements: +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro, significantly outperforming the WSD baseline. Crucially, these gains persist after fine-tuning, demonstrating robustness and practical utility.

Technology Category

Application Category

📝 Abstract

Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.

Problem

Research questions and friction points this paper is trying to address.

Connects learning rate decay with model merging theoretically

Identifies merge duration as key factor for model performance

Outperforms Warmup-Stable-Decay approach on multiple benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decay-free learning via checkpoint merging

Unified model averaging for decay emulation

Merge duration as key performance factor

🔎 Similar Papers

Checkpoint Merging via Bayesian Optimization in LLM Pretraining