Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses key challenges in matrix optimization—namely, the neglect of geometric structure, susceptibility to heavy-tailed noise, and the theoretical-practical gap between Nesterov momentum and inexact polar decomposition—by proposing Muon, a novel algorithm that integrates Nesterov momentum with inexact polar decomposition. Muon updates weights along the polar factor direction of the momentum matrix under heavy-tailed stochastic gradients. The paper establishes the first convergence theory for such algorithms without requiring prior knowledge of the heavy-tailed index and introduces an efficient stochastic low-rank polar decomposition technique. Within a unified inexact polar decomposition framework (e.g., Newton–Schulz iteration), the method achieves optimal iteration and sample complexity of $O(\varepsilon^{-(3\alpha-2)/(\alpha-1)})$ for finding an $\varepsilon$-stationary point, with experiments confirming the efficacy of its stochastic and inexact variants.
📝 Abstract
Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of $O \left(\varepsilon^{\frac{-(3α-2)}{(α-1)}} \right)$ for finding an $\varepsilon$-stationary point, where $α\in(1,2]$ denotes the heavy-tail index. For the inexact-polar setting with $σ_1=0$, we also provide guarantees that do not require prior knowledge of $α$. We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.
Problem

Research questions and friction points this paper is trying to address.

matrix optimization
heavy-tailed noise
polar decomposition
Nesterov momentum
non-convex optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Muon optimizer
polar decomposition
heavy-tailed noise
Nesterov momentum
randomized low-rank approximation