π€ AI Summary
This work addresses the limitations of conventional neural network training, which relies on predefined learning rate schedules, leading to strong path dependence, poor adaptability to data shifts, and costly hyperparameter tuningβwhile existing schedule-free methods still underperform carefully tuned baselines. The authors propose SF-NorMuon, a schedule-free spectral optimizer that incorporates weight decay within fast iterations to consistently yield high-quality models across arbitrary training durations. SF-NorMuon is the first schedule-free method to match or even surpass the performance of finely tuned AdamW, with theoretical guarantees on the stability of its spectral dynamics, revealing the critical role of fast-iteration weight decay in long-term optimization stability. Using a single hyperparameter configuration, it achieves parity or superiority over tuned AdamW across 1β8Γ Chinchilla-optimal training durations on both 125M and 772M parameter language models.
π Abstract
Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.