🤖 AI Summary
This work addresses the theoretical gap in the Muon optimizer, where the Newton–Schulz iteration is used as an approximation to the singular value decomposition (SVD) for polar decomposition. We establish, for the first time, convergence guarantees for this approximation by analyzing the behavior of a finite number of Newton–Schulz iterations in the orthogonalization of momentum matrices. Our analysis shows that the approximate method converges to stationary points at the same rate as exact SVD, with the error constant approaching one doubly exponentially in the number of iterations, while also eliminating the typical rank-deficiency square-root loss. These results demonstrate that only a few low-order iterations are sufficient to closely match SVD performance, thereby significantly improving computational efficiency and bridging the critical gap between the practical deployment and theoretical understanding of the Muon optimizer.
📝 Abstract
We analyze Muon as originally proposed and used in practice -- using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point at the same rate as the SVD-polar idealization, up to a constant factor for a given number $q$ of Newton-Schulz steps. We further analyze this constant factor and prove that it converges to 1 doubly exponentially in $q$ and improves with the degree of the polynomial used in Newton-Schulz for approximating the orthogonalization direction. We also prove that Muon removes the typical square-root-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at a much faster wall-clock time and explain how much momentum matrix orthogonalization via Newton-Schulz benefits over the vector-based optimizer. Overall, our theory justifies the practical Newton-Schulz design of Muon, narrowing its practice-theory gap.