🤖 AI Summary
While Muon optimizers demonstrate empirical effectiveness in training matrix-structured parameters, their theoretical convergence rate—$mathcal{O}(T^{-1/4})$ for standard variants—falls significantly short of the lower bound for nonconvex optimization.
Method: We propose Muon-VR2, the first Muon-type optimizer achieving optimal convergence. It integrates variance reduction with the Polyak–Łojasiewicz (PŁ) condition framework to enable rigorous analysis in stochastic nonconvex settings.
Contribution/Results: We establish the first $ ilde{mathcal{O}}(T^{-1/3})$ convergence rate for a Muon variant under stochastic nonconvexity, breaking the prior theoretical bottleneck. This bridges the long-standing gap between Muon’s practical success and its theoretical foundations, and provides a principled design pathway for accelerated variants. Experiments on CIFAR-10 image classification and C4 language modeling confirm Muon-VR2’s superior per-iteration convergence over existing methods.
📝 Abstract
The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap persists between its practical performance and theoretical understanding. Existing analyses indicate that the standard Muon variant achieves only a suboptimal convergence rate of $mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To explore the theoretical limits of the Muon framework, we construct and analyze a variance-reduced variant, termed Muon-VR2. We provide the first rigorous proof that incorporating a variance-reduction mechanism enables Muon-VR2 to attain an optimal convergence rate of $ ilde{mathcal{O}}(T^{-1/3})$, thereby matching the theoretical lower bound for this class of problems. Moreover, our analysis establishes convergence guarantees for Muon variants under the Polyak-Łojasiewicz (PŁ) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work provides the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.