MARS-M: When Variance Reduction Meets Matrices

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the suboptimal optimization efficiency and convergence instability encountered in training large-scale neural networks—including large language models. We propose MARS-M, the first optimizer that unifies variance reduction (via the MARS framework) with matrix-adaptive preconditioning (inspired by the Muon architecture). Theoretically, we establish an improved convergence rate of Õ(T⁻¹⁄³) for non-convex objectives, surpassing Muon’s Õ(T⁻¹⁄⁴). Empirically, MARS-M consistently reduces training loss across language modeling and computer vision tasks, and delivers significant and consistent gains on multiple downstream benchmarks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $ ilde{mathcal{O}}(T^{-1/3})$, which improves upon $ ilde{mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/MARS_M.

Problem

Research questions and friction points this paper is trying to address.

Combining variance reduction with matrix-based optimization methods

Improving convergence rate for neural network training optimization

Enhancing performance on language modeling and vision tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining variance reduction with matrix preconditioning

Achieving faster convergence rate than previous methods

Integrating MARS technique with Muon optimizer

🔎 Similar Papers

Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE