WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the complexity and limited generalization caused by reliance on learning rate (LR) decay in large language model pretraining. We propose the Warmup-Stable and Merge (WSM) framework, which unifies conventional LR decay into checkpoint model averaging and establishes, for the first time, a theoretical connection between LR decay and model merging—revealing merge duration as a critical performance determinant. WSM employs warmup-stable training followed by checkpoint aggregation, enabling decay-free optimization compatible with mainstream optimizers and existing training pipelines. Empirical evaluation shows consistent improvements: +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro, significantly outperforming the WSD baseline. Crucially, these gains persist after fine-tuning, demonstrating robustness and practical utility.

Technology Category

Application Category

📝 Abstract
Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.
Problem

Research questions and friction points this paper is trying to address.

Connects learning rate decay with model merging theoretically
Identifies merge duration as key factor for model performance
Outperforms Warmup-Stable-Decay approach on multiple benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decay-free learning via checkpoint merging
Unified model averaging for decay emulation
Merge duration as key performance factor
🔎 Similar Papers
No similar papers found.