Anytime Training with Schedule-Free Spectral Optimization

πŸ“… 2026-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

205K/year
πŸ€– AI Summary
This work addresses the limitations of conventional neural network training, which relies on predefined learning rate schedules, leading to strong path dependence, poor adaptability to data shifts, and costly hyperparameter tuningβ€”while existing schedule-free methods still underperform carefully tuned baselines. The authors propose SF-NorMuon, a schedule-free spectral optimizer that incorporates weight decay within fast iterations to consistently yield high-quality models across arbitrary training durations. SF-NorMuon is the first schedule-free method to match or even surpass the performance of finely tuned AdamW, with theoretical guarantees on the stability of its spectral dynamics, revealing the critical role of fast-iteration weight decay in long-term optimization stability. Using a single hyperparameter configuration, it achieves parity or superiority over tuned AdamW across 1–8Γ— Chinchilla-optimal training durations on both 125M and 772M parameter language models.
πŸ“ Abstract
Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.
Problem

Research questions and friction points this paper is trying to address.

anytime training
schedule-free optimization
learning-rate schedules
horizon dependence
neural network training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Schedule-Free Optimization
Spectral Optimizer
Anytime Training
Weight Decay
Continual Learning