Effective Quantization of Muon Optimizer States

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the high optimizer-state memory overhead of Muon—comparable to AdamW—in large language model (LLM) pretraining, this work proposes the first 8-bit quantized Muon optimizer. Methodologically, it integrates block-wise quantization, linear/dynamic dual-mode quantization, gradient-state compression, and matrix orthogonalization. Theoretical analysis reveals a superior quantization robustness mechanism compared to AdamW-style optimizers. Experiments on 1.6B-model pretraining and Llama 3.2-3B fine-tuning demonstrate a 74% reduction in optimizer-state memory consumption, while maintaining convergence speed on par with full-precision Muon and significantly outperforming both 8-bit AdamW and full-precision AdamW. Crucially, the dynamic quantization variant preserves training stability throughout optimization.

Technology Category

Application Category

📝 Abstract

The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and up to 2x computational efficiency over AdamW in LLM pretraining. Like AdamW, Muon is stateful, requiring storage of both model weights and accumulated gradients. While 8-bit AdamW variants mitigate this overhead using blockwise quantization, they are typically stable only under dynamic quantization - which improves stability on linear quantization for extreme values. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic schemes. We demonstrate that 8-bit Muon maintains stability under both, while delivering $sim$74% reduction in memory footprint compared to full-precision Muon. In extensive experiments, 8-bit Muon closely matches the performance of Muon while outperforming AdamW and 8-bit AdamW in pre-training a 1.6B model on 4B FineWeb tokens. It also shows competitive results when fine-tuning the Llama 3.2 3B model on post-training data. We also provide a theoretical perspective to help explain this robustness under quantization.

Problem

Research questions and friction points this paper is trying to address.

Develops 8-bit quantization for Muon optimizer states

Reduces memory footprint while maintaining training stability

Enables efficient LLM pretraining with blockwise quantization schemes

Innovation

Methods, ideas, or system contributions that make the work stand out.

8-bit Muon optimizer using blockwise quantization

Supports both linear and dynamic quantization schemes

Maintains stability while reducing memory footprint by 74%

🔎 Similar Papers

No similar papers found.