Muon is Scalable for LLM Training

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

The scalability of the Muon optimizer for large language models (LLMs) remains unverified. Method: We introduce two key techniques—weighted decay and per-parameter update-scale calibration—enabling out-of-the-box, hyperparameter-free training of Muon on 3B/16B Mixture-of-Experts (MoE) models. We further enhance system efficiency via matrix orthogonalization, distributed memory-optimal implementation, communication-efficient AllReduce, and a compute-optimal training paradigm. Results: Compared to AdamW, our approach achieves ~2× higher computational efficiency. Trained on 5.7T tokens, the Moonlight 3B/16B MoE models establish a new FLOPs–performance Pareto frontier. All model checkpoints and the optimized Muon implementation are publicly released.

Technology Category

Application Category

📝 Abstract

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $sim!2 imes$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

Problem

Research questions and friction points this paper is trying to address.

Scalability of Muon optimizer

Efficiency in large-scale training

Improving computational efficiency in LLM

Innovation

Methods, ideas, or system contributions that make the work stand out.

Matrix orthogonalization optimizer

Weight decay addition

Per-parameter scale adjustment

🔎 Similar Papers

No similar papers found.