🤖 AI Summary
This work investigates the performance discrepancy between the stochastic spectral optimizer Muon (an approximation of SignSVD) and SignSGD under varying data covariance structures. By analyzing a high-dimensional matrix least-squares problem, the authors derive a deterministic dynamical model that, combined with spectral analysis and a power-law covariance assumption, reveals for the first time that Muon with large batch sizes is equivalent to applying a square-root preconditioning to the covariance spectrum. Theoretical analysis demonstrates that under a power-law covariance model, the relative performance of the two optimizers falls into three distinct phases, determined by the interplay between the data spectral exponent α and the target spectral exponent β. The study precisely characterizes the conditions under which Muon outperforms SignSGD on anisotropic data, quantifies their optimal learning rate differences, and provides a principled foundation for optimizer selection.
📝 Abstract
Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent $α$ and target exponent $β$ shows there are three phases in the $(α,β)$ plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.