On the Convergence of Muon and Beyond

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

While Muon optimizers demonstrate empirical effectiveness in training matrix-structured parameters, their theoretical convergence rate—$mathcal{O}(T^{-1/4})$ for standard variants—falls significantly short of the lower bound for nonconvex optimization. Method: We propose Muon-VR2, the first Muon-type optimizer achieving optimal convergence. It integrates variance reduction with the Polyak–Łojasiewicz (PŁ) condition framework to enable rigorous analysis in stochastic nonconvex settings. Contribution/Results: We establish the first $ ilde{mathcal{O}}(T^{-1/3})$ convergence rate for a Muon variant under stochastic nonconvexity, breaking the prior theoretical bottleneck. This bridges the long-standing gap between Muon’s practical success and its theoretical foundations, and provides a principled design pathway for accelerated variants. Experiments on CIFAR-10 image classification and C4 language modeling confirm Muon-VR2’s superior per-iteration convergence over existing methods.

Technology Category

Application Category

📝 Abstract

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap persists between its practical performance and theoretical understanding. Existing analyses indicate that the standard Muon variant achieves only a suboptimal convergence rate of $mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To explore the theoretical limits of the Muon framework, we construct and analyze a variance-reduced variant, termed Muon-VR2. We provide the first rigorous proof that incorporating a variance-reduction mechanism enables Muon-VR2 to attain an optimal convergence rate of $ ilde{mathcal{O}}(T^{-1/3})$, thereby matching the theoretical lower bound for this class of problems. Moreover, our analysis establishes convergence guarantees for Muon variants under the Polyak-Łojasiewicz (PŁ) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work provides the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.

Problem

Research questions and friction points this paper is trying to address.

Analyzing Muon optimizer's suboptimal convergence rate

Proving optimal convergence for variance-reduced Muon-VR2 variant

Establishing convergence guarantees under Polyak-Łojasiewicz condition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variance-reduced Muon variant for optimization

Achieves optimal convergence rate O(T^{-1/3})

Convergence guarantees under Polyak-Łojasiewicz condition

🔎 Similar Papers

No similar papers found.