MARS-M: When Variance Reduction Meets Matrices

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the suboptimal optimization efficiency and convergence instability encountered in training large-scale neural networks—including large language models. We propose MARS-M, the first optimizer that unifies variance reduction (via the MARS framework) with matrix-adaptive preconditioning (inspired by the Muon architecture). Theoretically, we establish an improved convergence rate of Õ(T⁻¹⁄³) for non-convex objectives, surpassing Muon’s Õ(T⁻¹⁄⁴). Empirically, MARS-M consistently reduces training loss across language modeling and computer vision tasks, and delivers significant and consistent gains on multiple downstream benchmarks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $ ilde{mathcal{O}}(T^{-1/3})$, which improves upon $ ilde{mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/MARS_M.
Problem

Research questions and friction points this paper is trying to address.

Combining variance reduction with matrix-based optimization methods
Improving convergence rate for neural network training optimization
Enhancing performance on language modeling and vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining variance reduction with matrix preconditioning
Achieving faster convergence rate than previous methods
Integrating MARS technique with Muon optimizer