Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the poorly understood dynamics of singular values during LoRA fine-tuning with the Muon optimizer, revealing an unexpected phenomenon wherein singular values grow at approximately equal rates. By constructing a continuous-time Spectral Gradient Flow (SpecGF) model and integrating matrix factorization theory, orthogonalized update analysis, and ℓ² regularization, the study provides the first theoretical proof of this synchronized growth mechanism and establishes convergence guarantees to the global optimum from almost arbitrary initializations. Experimental results validate the presence and efficacy of this mechanism in practical LoRA fine-tuning scenarios, offering a foundational theoretical framework for understanding the spectral evolution and convergence behavior of the Muon optimizer.

Technology Category

Application Category

📝 Abstract
Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove"equal-rate"dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.
Problem

Research questions and friction points this paper is trying to address.

spectral gradient descent
LoRA
singular values
uniform growth
convergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral Gradient Descent
LoRA
Uniform Singular Value Growth
Global Convergence
Matrix Factorization
🔎 Similar Papers
C
Changmin Kang
Kim Jaechul Graduate School of Artificial Intelligence, KAIST, Seoul, South Korea
Jihun Yun
Jihun Yun
KRAFTON, Researcher
High-dimensional StatisticsSparse EstimationOptimizationMachine LearningDeep Learning
B
Baekrok Shin
Kim Jaechul Graduate School of Artificial Intelligence, KAIST, Seoul, South Korea
Y
Yeseul Cho
Kim Jaechul Graduate School of Artificial Intelligence, KAIST, Seoul, South Korea
Chulhee Yun
Chulhee Yun
Ewon Assistant Professor, KAIST Kim Jaechul Graduate School of AI
OptimizationDeep Learning TheoryMachine Learning Theory