Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitations of the Muon optimizer, whose isotropic updates fail to account for the highly ill-conditioned geometry and heavy-tailed curvature spectrum of deep neural network loss landscapes, leading to instability in high-curvature directions and slow convergence along flat ones. To overcome this, we propose Mousse, the first optimizer that integrates Kronecker-factored whitening with spectrally constrained optimization. Within the whitened coordinate system established by Shampoo, Mousse solves a curvature-aware anisotropic trust-region subproblem on the Stiefel manifold via Newton–Schulz iteration and polar decomposition, yielding geometrically adaptive orthogonal gradient updates. Experiments on language models ranging from 160M to 800M parameters demonstrate that Mousse reduces training steps by approximately 12% compared to Muon, with negligible additional computational overhead.

Technology Category

Application Category

📝 Abstract

Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this "egalitarian" constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbf{Mousse} (\textbf{M}uon \textbf{O}ptimization \textbf{U}tilizing \textbf{S}hampoo's \textbf{S}tructural \textbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around $\sim$12\% reduction in training steps with negligible computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Stiefel manifold

curvature spectrum

ill-conditioned optimization

spectral optimization

anisotropic landscape

Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral optimization

curvature-aware preconditioning

anisotropic trust region