Muon is Scalable for LLM Training

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The scalability of the Muon optimizer for large language models (LLMs) remains unverified. Method: We introduce two key techniques—weighted decay and per-parameter update-scale calibration—enabling out-of-the-box, hyperparameter-free training of Muon on 3B/16B Mixture-of-Experts (MoE) models. We further enhance system efficiency via matrix orthogonalization, distributed memory-optimal implementation, communication-efficient AllReduce, and a compute-optimal training paradigm. Results: Compared to AdamW, our approach achieves ~2× higher computational efficiency. Trained on 5.7T tokens, the Moonlight 3B/16B MoE models establish a new FLOPs–performance Pareto frontier. All model checkpoints and the optimized Muon implementation are publicly released.

Technology Category

Application Category

📝 Abstract
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $sim!2 imes$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.
Problem

Research questions and friction points this paper is trying to address.

Scalability of Muon optimizer
Efficiency in large-scale training
Improving computational efficiency in LLM
Innovation

Methods, ideas, or system contributions that make the work stand out.

Matrix orthogonalization optimizer
Weight decay addition
Per-parameter scale adjustment
🔎 Similar Papers
No similar papers found.
J
Jingyuan Liu
Moonshot AI
Jianlin Su
Jianlin Su
Moonshot AI
Xingcheng Yao
Xingcheng Yao
Moonshot AI
Z
Zhejun Jiang
Moonshot AI
Guokun Lai
Guokun Lai
Inflection AI
machine learning
Yulun Du
Yulun Du
Carnegie Mellon University
Deep LearningNatural Language ProcessingHuman-AI Interaction
Y
Yidao Qin
Moonshot AI
W
Weixin Xu
Moonshot AI
E
Enzhe Lu
Moonshot AI
J
Junjie Yan
Moonshot AI
Y
Yanru Chen
Moonshot AI
H
Huabin Zheng
Moonshot AI
Y
Yibo Liu
Moonshot AI
Shaowei Liu
Shaowei Liu
University of Illinois Urbana-Champaign
Computer VisionRobotics
B
Bohong Yin
Moonshot AI
Weiran He
Weiran He
Unknown affiliation
H
Han Zhu
Moonshot AI
Yuzhi Wang
Yuzhi Wang
Research Engineer @ Megvii Inc.
Computer VisionArtificial IntelligenceWireless Sensor Network
J
Jianzhou Wang
Moonshot AI
M
Mengnan Dong
Moonshot AI
Z
Zheng Zhang
Moonshot AI
Y
Yongsheng Kang
Moonshot AI
H
Hao Zhang
Moonshot AI
X
Xinran Xu
Moonshot AI
Yutao Zhang
Yutao Zhang
Moonshot AI
Y
Yuxin Wu
Moonshot AI
X
Xinyu Zhou
Moonshot AI