🤖 AI Summary
This work addresses the high sensitivity of the Muon optimizer to low-bit quantization errors, which arises because its orthogonalization preserves only directional information, thereby amplifying minor perturbations and causing training instability. To overcome this limitation, the authors propose MuonQ, the first framework enabling stable 4-bit training with the Muon optimizer. MuonQ employs a three-stage strategy: pre-quantization normalization to suppress error accumulation, power iteration-based decomposition to preserve the direction of dominant singular vectors, and μ-law companding quantization to enhance numerical resolution in dense regions. Evaluated on GPT and LLaMA architectures, MuonQ achieves a 7.3× reduction in optimizer memory footprint while matching the full-precision Muon optimizer in both training loss and downstream task performance.
📝 Abstract
The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon's optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt $μ$-law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense-region distinguishability. Together, these techniques enable stable 4-bit quantization of Muon's optimizer states. Pre-training experiments on GPT-style and LLaMA-style models demonstrate that MuonQ at 4-bit precision closely matches full-precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 $\times$. Our code is available at https://github.com/YupengSu/MuonQ.