🤖 AI Summary
This work investigates the poorly understood dynamics of singular values during LoRA fine-tuning with the Muon optimizer, revealing an unexpected phenomenon wherein singular values grow at approximately equal rates. By constructing a continuous-time Spectral Gradient Flow (SpecGF) model and integrating matrix factorization theory, orthogonalized update analysis, and ℓ² regularization, the study provides the first theoretical proof of this synchronized growth mechanism and establishes convergence guarantees to the global optimum from almost arbitrary initializations. Experimental results validate the presence and efficacy of this mechanism in practical LoRA fine-tuning scenarios, offering a foundational theoretical framework for understanding the spectral evolution and convergence behavior of the Muon optimizer.
📝 Abstract
Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove"equal-rate"dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.