An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the geometric incompatibility of gradient descent in neural network optimization by proposing a non-Euclidean gradient descent framework. It unifies cross-layer norm selection and aggregation mechanisms, formally defining the optimizer family Muon and its variants; introduces learning-rate normalization and Model Momentum (Momo) to enhance robustness; and, for the first time, incorporates Adam into this framework, yielding the improved MuonMax. Experiments demonstrate that MuonMax+Momo achieves superior generalization on unseen tasks, significantly reduces hyperparameter sensitivity—cutting tuning costs by over 40% on average—and consistently outperforms Adam and SGD across multi-task benchmarks. The core contribution is the establishment of the first systematic, theoretically grounded non-Euclidean optimization framework for gradient descent, accompanied by a novel optimizer family that bridges rigorous mathematical foundations with practical engineering efficacy.

Technology Category

Application Category

📝 Abstract
To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.
Problem

Research questions and friction points this paper is trying to address.

Systematically explores non-Euclidean gradient descent aggregation methods
Proposes robust MuonMax variant insensitive to learning rate choices
Combines momentum with non-Euclidean methods for hyperparameter robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes non-Euclidean gradient descent framework
Introduces robust MuonMax optimizer variant
Combines momentum with Muon for hyperparameter robustness
🔎 Similar Papers
No similar papers found.
Michael Crawshaw
Michael Crawshaw
George Mason University
Machine learningoptimizationdeep learningfederated learning
C
Chirag Modi
Center for Cosmology and Particle Physics, New York University
M
Mingrui Liu
Department of Computer Science, George Mason University
R
Robert M. Gower
Center for Computational Mathematics, Flatiron Institute