Muon is Not That Special: Random or Inverted Spectra Work Just as Well

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work investigates whether the performance advantage of non-Euclidean optimizers such as Muon stems from their geometric structure. To this end, we introduce two new optimizer families: Freon, which enables smooth interpolation between SGD and Muon, and Kaon, which entirely discards geometric structure and relies solely on random spectral design. Through analyses based on Schatten quasi-norms, QDWH iterations, singular value replacement, and random feature models, we find that peak performance occurs in the quasi-norm regime, and remarkably, Kaon—despite lacking explicit geometry—matches Muon’s performance. These findings suggest that Muon’s success arises not from faithful tracking of global geometry, but rather from favorable local alignment, descent potential, and step-size optimality, thereby challenging the conventional view that non-Euclidean optimization fundamentally depends on geometric structure.
📝 Abstract
The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.
Problem

Research questions and friction points this paper is trying to address.

non-Euclidean optimization
geometric structure
optimization performance
second-order methods
linear minimization oracle
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-Euclidean optimization
Schatten quasi-norms
random singular values
step-size optimality
linear minimization oracle
🔎 Similar Papers
No similar papers found.