🤖 AI Summary
This work addresses the lack of a clear geometric theory for gradient normalization in deep learning, particularly the absence of a unified spectral-norm-based framework for matrix or block-structured parameters. By modeling parameters as probability measures in the mean-field limit, the authors introduce a family of spectral Wasserstein distances indexed by Schatten norms on positive semidefinite matrices, which unifies gradient descent dynamics ranging from standard updates to those used in methods like Muon. They establish an equivalence between static and dynamic optimal transport formulations, generalize the Bures formula, and demonstrate an exact correspondence between normalized continuity equations and finite-particle matrix flows. The proposed distance is proven to be a bona fide metric equivalent to the classical Wasserstein-2 distance, admits closed-form solutions under Gaussian covariances, and—remarkably—yields the first results on geodesic convexity and spectral unbalanced transport mechanisms over the sphere.
📝 Abstract
Gradient normalization is central in deep-learning optimization because it stabilizes training and reduces sensitivity to scale. For deep architectures, parameters are naturally grouped into matrices or blocks, so spectral normalizations are often more faithful than coordinatewise Euclidean ones; Muon is the main motivating example of this paper. More broadly, we study a family of spectral normalization rules, ranging from ordinary gradient descent to Muon and intermediate Schatten-type schemes, in a mean-field regime where parameters are modeled by probability measures. We introduce a family of Spectral Wasserstein distances indexed by a norm gamma on positive semidefinite matrices. The trace norm recovers the classical quadratic Wasserstein distance, the operator norm recovers the Muon geometry, and intermediate Schatten norms interpolate between them. We develop the static Kantorovich formulation, prove comparison bounds with W2, derive a max-min representation, and obtain a conditional Brenier theorem. For Gaussian marginals, the problem reduces to a constrained optimization on covariance matrices, extending the Bures formula and yielding a closed form for commuting covariances in the Schatten family. For monotone norms, including all Schatten cases, we prove the equivalence between the static and dynamic Benamou-Brenier formulations, deduce that the resulting transport cost is a genuine metric equivalent to W2 in fixed dimension, and show that the induced Gaussian covariance cost is also a metric. We then interpret the associated normalized continuity equation as a Spectral Wasserstein gradient flow, identify its exact finite-particle counterpart as a normalized matrix flow, obtain first geodesic-convexity results, and show how positively homogeneous mean-field models induce a spectral unbalanced transport on the sphere.