Muon is Not That Special: Random or Inverted Spectra Work Just as Well

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work investigates whether the performance advantage of non-Euclidean optimizers such as Muon stems from their geometric structure. To this end, we introduce two new optimizer families: Freon, which enables smooth interpolation between SGD and Muon, and Kaon, which entirely discards geometric structure and relies solely on random spectral design. Through analyses based on Schatten quasi-norms, QDWH iterations, singular value replacement, and random feature models, we find that peak performance occurs in the quasi-norm regime, and remarkably, Kaon—despite lacking explicit geometry—matches Muon’s performance. These findings suggest that Muon’s success arises not from faithful tracking of global geometry, but rather from favorable local alignment, descent potential, and step-size optimality, thereby challenging the conventional view that non-Euclidean optimization fundamentally depends on geometric structure.

📝 Abstract

The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.

Problem

Research questions and friction points this paper is trying to address.

non-Euclidean optimization

geometric structure

optimization performance

second-order methods

linear minimization oracle

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-Euclidean optimization

Schatten quasi-norms

random singular values