🤖 AI Summary
This work addresses the inefficiency and instability of optimizers when training large language models under high token-to-parameter ratios (T2P ≈ 200) and compute-optimal regimes. Building upon the Muon optimizer, the authors propose an enhanced variant, Muon+, which incorporates an explicit normalization step applied after gradient (or momentum) orthogonalization. This modification effectively stabilizes and improves training dynamics. Empirical evaluations across GPT and LLaMA architectures with model sizes ranging from 130M to 1B parameters demonstrate that Muon+ consistently outperforms the original Muon, yielding significant improvements in both training and validation perplexity. The results highlight Muon+ as a particularly effective solution for industrial-scale pretraining scenarios characterized by high T2P ratios.
📝 Abstract
The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.