🤖 AI Summary
The scalability of the Muon optimizer for large language models (LLMs) remains unverified. Method: We introduce two key techniques—weighted decay and per-parameter update-scale calibration—enabling out-of-the-box, hyperparameter-free training of Muon on 3B/16B Mixture-of-Experts (MoE) models. We further enhance system efficiency via matrix orthogonalization, distributed memory-optimal implementation, communication-efficient AllReduce, and a compute-optimal training paradigm. Results: Compared to AdamW, our approach achieves ~2× higher computational efficiency. Trained on 5.7T tokens, the Moonlight 3B/16B MoE models establish a new FLOPs–performance Pareto frontier. All model checkpoints and the optimized Muon implementation are publicly released.
📝 Abstract
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $sim!2 imes$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.