ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work proposes ScheduleFree+, a novel optimization framework that extends schedule-free learning to large language models for the first time. Addressing the reliance of conventional training on intricate learning rate schedules and extensive hyperparameter tuning, ScheduleFree+ achieves efficient training without any learning rate scheduling by refining model averaging and checkpoint fusion mechanisms. Under a training budget of 1000 tokens per parameter, the method outperforms the current best scheduled approach by 31% and significantly surpasses baseline strategies such as Warmup-Stable-Decay. Furthermore, it provides theoretical justification for model averaging in pretraining, offering both practical gains and conceptual insights into schedule-free optimization at scale.
📝 Abstract
Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.
Problem

Research questions and friction points this paper is trying to address.

Schedule-Free Learning
Large Language Models
Learning Rate Scheduling
Scalability
Model Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Schedule-Free Learning
learning-rate-free
large language models
model averaging
checkpoint merging
🔎 Similar Papers
No similar papers found.