🤖 AI Summary
This work questions whether the success of the Muon optimizer can be explained by classical convex Lipschitz theory. We establish, for the first time, that without smoothness assumptions, Muon fails to converge on convex Lipschitz functions under any learning rate schedule. Although incorporating error feedback restores theoretical convergence guarantees, it degrades empirical performance. Through a combination of convex analysis, non-Euclidean subgradient methods, and empirical evaluation on CIFAR-10 image classification and nanoGPT language modeling tasks, we demonstrate that Muon’s effectiveness likely hinges on structural properties—such as smoothness—not captured by existing convergence theory. These findings challenge the explanatory power of current optimization theory in accounting for the behavior of practical optimizers.
📝 Abstract
Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon.
Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in
two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.