Muon Does Not Converge on Convex Lipschitz Functions

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work questions whether the success of the Muon optimizer can be explained by classical convex Lipschitz theory. We establish, for the first time, that without smoothness assumptions, Muon fails to converge on convex Lipschitz functions under any learning rate schedule. Although incorporating error feedback restores theoretical convergence guarantees, it degrades empirical performance. Through a combination of convex analysis, non-Euclidean subgradient methods, and empirical evaluation on CIFAR-10 image classification and nanoGPT language modeling tasks, we demonstrate that Muon’s effectiveness likely hinges on structural properties—such as smoothness—not captured by existing convergence theory. These findings challenge the explanatory power of current optimization theory in accounting for the behavior of practical optimizers.

📝 Abstract

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.

Problem

Research questions and friction points this paper is trying to address.

Muon

convergence

convex Lipschitz functions

optimization

deep learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Muon

convex Lipschitz functions

non-convergence

error feedback

momentum methods

🔎 Similar Papers

Learning truly monotone operators with applications to nonlinear inverse problems

2024-03-30arXiv.orgCitations: 0

💼 Related Jobs

Research Engineer, Monetization AI