MARS: Unleashing the Power of Variance Reduction for Training Large Models

📅 2024-11-15
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Variance reduction (VR) optimization algorithms face challenges in large language model (LLM) training—including limited applicability, slow convergence, and instability. To address these issues, we propose MARS, a unified optimization framework that tightly integrates VR techniques with preconditioned gradient methods. MARS introduces a scaled stochastic recursive momentum mechanism and an adaptive learning rate strategy. It is instantiated into three variants compatible with AdamW, Lion, and Shampoo, thereby establishing the first theoretical connection between VR methods and mainstream adaptive optimizers. Empirical evaluation on GPT-2 training demonstrates that MARS achieves up to 25% faster convergence than AdamW while exhibiting enhanced training stability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Training deep neural networks--and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin. The implementation of MARS is available at https://github.com/AGI-Arena/MARS.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency in training large models
Enhances variance reduction in optimization
Unifies preconditioned gradient with variance reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified optimization framework MARS
Scaled stochastic recursive momentum
Preconditioned gradient updates integration
🔎 Similar Papers
No similar papers found.