MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the high sensitivity of the Muon optimizer to low-bit quantization errors, which arises because its orthogonalization preserves only directional information, thereby amplifying minor perturbations and causing training instability. To overcome this limitation, the authors propose MuonQ, the first framework enabling stable 4-bit training with the Muon optimizer. MuonQ employs a three-stage strategy: pre-quantization normalization to suppress error accumulation, power iteration-based decomposition to preserve the direction of dominant singular vectors, and μ-law companding quantization to enhance numerical resolution in dense regions. Evaluated on GPT and LLaMA architectures, MuonQ achieves a 7.3× reduction in optimizer memory footprint while matching the full-precision Muon optimizer in both training loss and downstream task performance.

📝 Abstract

The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon's optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt $μ$-law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense-region distinguishability. Together, these techniques enable stable 4-bit quantization of Muon's optimizer states. Pre-training experiments on GPT-style and LLaMA-style models demonstrate that MuonQ at 4-bit precision closely matches full-precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 $\times$. Our code is available at https://github.com/YupengSu/MuonQ.

Problem

Research questions and friction points this paper is trying to address.

Muon optimizer

low-bit quantization

quantization error

directional fidelity

optimizer state

Innovation

Methods, ideas, or system contributions that make the work stand out.

directional fidelity optimization

low-bit quantization

gradient orthogonalization