Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the high computational and communication overhead of the Muon optimizer in large-scale pretraining, which stems from multiple Newton–Schulz (NS) iterations required for polar decomposition. To mitigate this, we propose Muon², which incorporates an Adam-style adaptive second-moment preconditioner to improve the spectral properties of the momentum matrix, thereby accelerating convergence of the polar decomposition. We further introduce Muon²-F, a memory-efficient factorized variant that preserves most performance gains with minimal additional memory cost. Notably, this is the first application of preconditioning techniques to enhance the conditioning of matrices in orthogonal optimization, accompanied by a novel direction-alignment metric to quantify orthogonality quality. Experiments on GPT and LLaMA models ranging from 60M to 1.3B parameters demonstrate that Muon² reduces NS iterations by up to 40% compared to the original Muon, achieving significant performance improvements.

Technology Category

Application Category

📝 Abstract

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon$^2$, an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon$^2$, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon$^2$ demonstrates dramatic improvement over Muon at each polar step. Across GPT and LLaMA pre-training experiments from 60M to 1.3B parameters, Muon$^2$ consistently outperforms Muon and recent Muon variants while reducing NS iterations by 40\%. We further introduce Muon$^2$-F, a memory-efficient factorized variant that preserves most of the gains of Muon$^2$ with negligible memory overhead.

Problem

Research questions and friction points this paper is trying to address.

Muon

optimization

Newton-Schulz iteration

preconditioning

orthogonalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive preconditioning

second-moment estimation

orthogonalization acceleration