AMUSE: Anytime Muon with Stable Gradient Evaluation

πŸ“… 2026-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

208K/year
πŸ€– AI Summary
This work addresses the susceptibility of the MuON optimizer to oscillations in high-curvature directions and its reliance on learning rate scheduling by introducing a novel optimizer that integrates MuON with Schedule-Free principles. The proposed method employs a time-varying interpolation coefficient to dynamically balance rapid optimization in the dominant subspace with stable iterate averaging, while incorporating momentum orthogonalization and gradient evaluation to effectively suppress noise-induced oscillations in steep directionsβ€”all without requiring any learning rate schedule. Empirical results demonstrate consistent superiority over both Schedule-Free AdamW and MuON across vision tasks and large language model pretraining, achieving state-of-the-art performance on the Pareto frontier of accuracy versus iteration count and marking the first schedule-free optimizer to simultaneously deliver fast convergence and anytime training stability.
πŸ“ Abstract
Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.
Problem

Research questions and friction points this paper is trying to address.

optimization
learning rate schedule
gradient oscillation
loss landscape
anytime training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Muon
Schedule-Free Optimization
Loss Landscape
Anytime Training
Gradient Orthogonalization
πŸ”Ž Similar Papers