MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing computational efficiency and scale mismatch in existing orthogonal optimizers for matrix parameters. The authors propose MuonEq, a lightweight pre-conditioning approach that, for the first time, incorporates row/column norm normalization as a zeroth-order whitening proxy during pre-processing to effectively eliminate marginal scale bias while preserving the theoretical convergence guarantees of Muon-type methods. Requiring only O(m+n) auxiliary state variables, MuonEq supports three strategies—row normalization (R), column normalization (C), and combined row-column normalization (RC)—and integrates a limited-step Newton–Schulz orthogonalization. In pretraining LLaMA2 on the C4 dataset, the R variant consistently outperforms the original Muon optimizer on both 130M and 350M models, achieving faster convergence and lower validation perplexity.
📝 Abstract
Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.
Problem

Research questions and friction points this paper is trying to address.

orthogonalized-update optimizers
matrix-valued parameters
preconditioning
equilibration
training convergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

pre-orthogonalization equilibration
lightweight normalization
matrix-valued optimization
Newton-Schulz orthogonalization
scale mismatch correction
🔎 Similar Papers
No similar papers found.
D
Da Chang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Q
Qiankun Shi
Pengcheng Laboratory
L
Lvgang Zhang
Sun Yat-sen University
Y
Yu Li
George Washington University
R
Ruijie Zhang
University of Chinese Academy of Sciences
Y
Yao Lu
Pengcheng Laboratory
Yongxiang Liu
Yongxiang Liu
Professor, National University of Defense Technology
Remote SensingSynthetic Aperture RadarRadarImage ProcessingPattern Recognition
Ganzhao Yuan
Ganzhao Yuan
Shenzhen University of Advanced Technology (SUAT), China
Nonlinear OptimizationMachine Learning