Anytime Training with Schedule-Free Spectral Optimization

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the limitations of conventional neural network training, which relies on predefined learning rate schedules, leading to strong path dependence, poor adaptability to data shifts, and costly hyperparameter tuning—while existing schedule-free methods still underperform carefully tuned baselines. The authors propose SF-NorMuon, a schedule-free spectral optimizer that incorporates weight decay within fast iterations to consistently yield high-quality models across arbitrary training durations. SF-NorMuon is the first schedule-free method to match or even surpass the performance of finely tuned AdamW, with theoretical guarantees on the stability of its spectral dynamics, revealing the critical role of fast-iteration weight decay in long-term optimization stability. Using a single hyperparameter configuration, it achieves parity or superiority over tuned AdamW across 1–8× Chinchilla-optimal training durations on both 125M and 772M parameter language models.

📝 Abstract

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

Problem

Research questions and friction points this paper is trying to address.

anytime training

schedule-free optimization

learning-rate schedules

horizon dependence

neural network training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Schedule-Free Optimization

Spectral Optimizer

Anytime Training

Weight Decay

Continual Learning

🔎 Similar Papers

Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

2024-05-24arXiv.orgCitations: 1

Spike No More: Stabilizing the Pre-training of Large Language Models

2023-12-28arXiv.orgCitations: 15

💼 Related Jobs

No related jobs found.

Authors to Follow