MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of the Muon optimizer to converge to sharp local minima during large language model training, which compromises generalization. To mitigate this issue, we propose a novel optimization method that integrates curvature awareness with Nesterov acceleration. Our approach introduces an acceleration term based on the exponential moving average of gradient differences and, for the first time, combines Nesterov-type momentum with a matrix orthogonalization framework, thereby enabling synergistic spectral norm regularization and geometry-aware optimization. The resulting algorithm substantially enhances training stability and efficiency for large-scale Mixture-of-Experts (MoE) models, consistently outperforming both Muon and AdamW across models ranging from 1B to 68B parameters. Notably, the fine-tuned 68B model achieves state-of-the-art performance on general capabilities, mathematical reasoning, and code generation benchmarks.
📝 Abstract
The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.
Problem

Research questions and friction points this paper is trying to address.

large language model training
sharp local minima
first-order optimization
convergence
optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Muon optimizer
Nesterov acceleration
matrix orthogonalization
curvature-aware optimization
Mixture-of-Experts
🔎 Similar Papers
No similar papers found.