Effective Quantization of Muon Optimizer States

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high optimizer-state memory overhead of Muon—comparable to AdamW—in large language model (LLM) pretraining, this work proposes the first 8-bit quantized Muon optimizer. Methodologically, it integrates block-wise quantization, linear/dynamic dual-mode quantization, gradient-state compression, and matrix orthogonalization. Theoretical analysis reveals a superior quantization robustness mechanism compared to AdamW-style optimizers. Experiments on 1.6B-model pretraining and Llama 3.2-3B fine-tuning demonstrate a 74% reduction in optimizer-state memory consumption, while maintaining convergence speed on par with full-precision Muon and significantly outperforming both 8-bit AdamW and full-precision AdamW. Crucially, the dynamic quantization variant preserves training stability throughout optimization.

Technology Category

Application Category

📝 Abstract
The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and up to 2x computational efficiency over AdamW in LLM pretraining. Like AdamW, Muon is stateful, requiring storage of both model weights and accumulated gradients. While 8-bit AdamW variants mitigate this overhead using blockwise quantization, they are typically stable only under dynamic quantization - which improves stability on linear quantization for extreme values. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic schemes. We demonstrate that 8-bit Muon maintains stability under both, while delivering $sim$74% reduction in memory footprint compared to full-precision Muon. In extensive experiments, 8-bit Muon closely matches the performance of Muon while outperforming AdamW and 8-bit AdamW in pre-training a 1.6B model on 4B FineWeb tokens. It also shows competitive results when fine-tuning the Llama 3.2 3B model on post-training data. We also provide a theoretical perspective to help explain this robustness under quantization.
Problem

Research questions and friction points this paper is trying to address.

Develops 8-bit quantization for Muon optimizer states
Reduces memory footprint while maintaining training stability
Enables efficient LLM pretraining with blockwise quantization schemes
Innovation

Methods, ideas, or system contributions that make the work stand out.

8-bit Muon optimizer using blockwise quantization
Supports both linear and dynamic quantization schemes
Maintains stability while reducing memory footprint by 74%
🔎 Similar Papers
No similar papers found.
A
Aman Gupta
Nubank
R
Rafael Celente
Nubank
A
Abhishek Shivanna
Nubank
D
D. T. Braithwaite
Nubank
Gregory Dexter
Gregory Dexter
LinkedIn Corporation
Shao Tang
Shao Tang
Linkedin
LLM Post-TrainingAgentOptimization
H
Hiroto Udagawa
Nubank
D
Daniel Silva
Nubank
R
Rohan Ramanath
Nubank
S
S. Sathiya Keerthi
Nubank