Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing theoretical frameworks fail to explain why non-Euclidean SGD variants—such as SignSGD, Lion, and Muon—empirically outperform standard SGD in training deep neural networks. Method: We develop a unified convergence analysis framework that incorporates structural smoothness assumptions and non-Euclidean geometric modeling, integrating extrapolation, momentum-based variance reduction, and explicit exploitation of the sparsity and low-rank structure inherent in Hessian matrices and gradient noise. Contribution/Results: We provide the first rigorous proof that such methods achieve faster convergence rates than standard SGD under realistic structural assumptions. Our theoretical guarantees match or improve upon the optimal bounds established for adaptive optimizers like AdaGrad and Shampoo. This work delivers the first unified, stronger theoretical foundation for a broad class of high-performance optimizers, thereby bridging the long-standing gap between empirical success and theoretical understanding in deep learning optimization.

Technology Category

Application Category

📝 Abstract

Recently, several instances of non-Euclidean SGD, including SignSGD, Lion, and Muon, have attracted significant interest from the optimization community due to their practical success in training deep neural networks. Consequently, a number of works have attempted to explain this success by developing theoretical convergence analyses. Unfortunately, these results cannot properly justify the superior performance of these methods, as they could not beat the convergence rate of vanilla Euclidean SGD. We resolve this important open problem by developing a new unified convergence analysis under the structured smoothness and gradient noise assumption. In particular, our results indicate that non-Euclidean SGD (i) can exploit the sparsity or low-rank structure of the upper bounds on the Hessian and gradient noise, (ii) can provably benefit from popular algorithmic tools such as extrapolation or momentum variance reduction, and (iii) can match the state-of-the-art convergence rates of adaptive and more complex optimization algorithms such as AdaGrad and Shampoo.

Problem

Research questions and friction points this paper is trying to address.

Analyzing convergence of non-Euclidean SGD methods

Explaining superior performance beyond Euclidean SGD

Unifying theoretical framework for structured optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified convergence analysis for non-Euclidean SGD

Exploits sparsity in Hessian and gradient noise

Matches convergence rates of complex adaptive algorithms

🔎 Similar Papers

Regularized Adaptive Momentum Dual Averaging with an Efficient Inexact Subproblem Solver for Training Structured Neural Network