Muon is Provably Faster with Momentum Variance Reduction

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Non-Euclidean LMO-based optimizers (e.g., Muon, Scion) suffer from slower convergence than Adam in large language model (LLM) training. Method: This work systematically integrates momentum variance reduction (MVR) into the generic Gluon optimization framework for the first time, establishing theoretical guarantees under both non-convex and star-convex settings. To accommodate neural network layer structures, we introduce a novel generalized smoothness assumption. Contribution/Results: We rigorously prove that MVR improves the non-convex convergence rate of Non-Euclidean LMO optimizers from (O(K^{-1/4})) to (O(K^{-1/3}))—the first such rate improvement for this class of methods. Empirical evaluation on LLM training demonstrates that our algorithm achieves significantly higher iteration efficiency than both Adam and the original Muon, validating the practical efficacy of the theoretical advancement.

Technology Category

Application Category

📝 Abstract

Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from ${cal O} (frac{1}{K^{1/4}})$ to ${cal O} (frac{1}{K^{1/3}})$. Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.

Problem

Research questions and friction points this paper is trying to address.

Improving convergence rates of non-Euclidean optimizers via momentum variance reduction

Generalizing MVR integration into the Gluon framework for neural networks

Providing theoretical and empirical validation for enhanced iteration complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Momentum variance reduction replaces vanilla momentum

Incorporates MVR into Gluon framework for generalization

Improves convergence rate in non-convex optimization scenarios

🔎 Similar Papers

Quantum Long Short-Term Memory for Drug Discovery