To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study investigates whether the Muon optimizer, while accelerating training, compromises generalization or exacerbates overfitting to spurious features due to its lack of a simplicity bias. Through theoretical analysis, modeling of optimizer dynamics, and controlled experiments, the work systematically reveals— for the first time—key differences between Muon and traditional optimizers like SGD in learning trajectories and solution structures. Specifically, the absence of an implicit simplicity bias in Muon hinders the model’s ability to identify structures shared across tasks and renders it more prone to fitting spurious correlations. The findings underscore that inductive biases inherent in optimizer design play a decisive role in generalization, offering theoretical grounding for selecting optimizers that are both efficient and robust.

Technology Category

Application Category

📝 Abstract

For a long period of time, Adam has served as the ubiquitous default choice for training deep neural networks. Recently, many new optimizers have been introduced, out of which Muon has perhaps gained the highest popularity due to its superior training speed. While many papers set out to validate the benefits of Muon, our paper investigates the potential downsides stemming from the mechanism driving this speedup. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does provide justification for the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon-optimized models have disadvantages. The core problem we emphasize is that Muon optimization removes a simplicity bias that is naturally preserved by older, more thoroughly studied methods like Stochastic Gradient Descent (SGD). We take first steps toward understanding consequences this may have: Muon might struggle to uncover common underlying structure across tasks, and be more prone to fitting spurious features. More broadly, this paper should serve as a reminder: when developing new optimizers, it is essential to consider the biases they introduce, as these biases can fundamentally change a model's behavior -- for better or for worse.

Problem

Research questions and friction points this paper is trying to address.

simplicity bias

optimizer

Muon

generalization

spurious features

Innovation

Methods, ideas, or system contributions that make the work stand out.

simplicity bias

optimizer bias

Muon