When do spectral gradient updates help in deep learning?

πŸ“… 2025-12-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates the conditions under which spectral gradient methods (e.g., Muon) outperform Euclidean gradient descent. Method: We propose a layer-wise criterion based on activation matrices: spectral optimization is advantageous when the ratio of nuclear norm to Frobenius norm significantly exceeds the stable rankβ€”a measure rooted in random matrix theory. Contribution/Results: Our criterion is the first to jointly leverage stable rank analysis and nuclear-norm ratio metrics, revealing the intrinsic advantage of spectral updates under high-dimensional low-rank structure. We theoretically prove that deep networks exhibit low stable rank at initialization. Empirical validation on random feature models and NanoGPT-scale Transformers confirms that gradients consistently maintain a high nuclear-norm ratio, and the performance gain from spectral optimization increases with parameter dimensionality.

Technology Category

Application Category

πŸ“ Abstract
Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.
Problem

Research questions and friction points this paper is trying to address.

Predicts when spectral gradient updates outperform Euclidean gradient descent in deep learning
Analyzes low stable rank of activations and gradient nuclear-to-Frobenius ratio in networks
Validates conditions for spectral methods' effectiveness in deep neural networks and transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layerwise condition predicts spectral update advantage
Low stable rank in activations enables spectral gradient effectiveness
Large nuclear-to-Frobenius ratio in gradients scales with dimension
πŸ”Ž Similar Papers
No similar papers found.