🤖 AI Summary
To address the high optimizer-state memory overhead of Muon—comparable to AdamW—in large language model (LLM) pretraining, this work proposes the first 8-bit quantized Muon optimizer. Methodologically, it integrates block-wise quantization, linear/dynamic dual-mode quantization, gradient-state compression, and matrix orthogonalization. Theoretical analysis reveals a superior quantization robustness mechanism compared to AdamW-style optimizers. Experiments on 1.6B-model pretraining and Llama 3.2-3B fine-tuning demonstrate a 74% reduction in optimizer-state memory consumption, while maintaining convergence speed on par with full-precision Muon and significantly outperforming both 8-bit AdamW and full-precision AdamW. Crucially, the dynamic quantization variant preserves training stability throughout optimization.
📝 Abstract
The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and up to 2x computational efficiency over AdamW in LLM pretraining. Like AdamW, Muon is stateful, requiring storage of both model weights and accumulated gradients. While 8-bit AdamW variants mitigate this overhead using blockwise quantization, they are typically stable only under dynamic quantization - which improves stability on linear quantization for extreme values. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic schemes. We demonstrate that 8-bit Muon maintains stability under both, while delivering $sim$74% reduction in memory footprint compared to full-precision Muon. In extensive experiments, 8-bit Muon closely matches the performance of Muon while outperforming AdamW and 8-bit AdamW in pre-training a 1.6B model on 4B FineWeb tokens. It also shows competitive results when fine-tuning the Llama 3.2 3B model on post-training data. We also provide a theoretical perspective to help explain this robustness under quantization.