Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing adaptive optimizers struggle to unify the treatment of vector and matrix parameters, particularly when extending coordinate-wise adaptivity—such as in Adam—to spectral optimization over matrices, due to both theoretical and implementation barriers. To overcome this, we propose DeVA, a novel framework that, for the first time, decouples the AdaGrad update rule into two orthogonal components: a variance-adaptive term and a scale-invariant term. This decomposition enables a unified paradigm for adaptive optimization spanning from vectors to matrix spectra. By integrating block-wise smoothness analysis with spectral adaptivity, DeVA seamlessly fits into modern deep learning training pipelines. Empirical results demonstrate that DeVA consistently outperforms state-of-the-art methods such as MuON and SOAP on language modeling and image classification tasks, achieving faster convergence and reducing token consumption by approximately 6.6%.

Technology Category

Application Category

📝 Abstract
Adaptive methods like Adam have become the $\textit{de facto}$ standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces $\textbf{DeVA}$ ($\textbf{De}$coupled $\textbf{V}$ariance $\textbf{A}$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6\%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled-Variance-Adaptation
Problem

Research questions and friction points this paper is trying to address.

adaptive gradient descent
vector optimization
matrix optimization
spectral optimization
variance adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Variance Adaptation
Adaptive Gradient Descent
Matrix Spectral Optimization
Scale-Invariant Updates
Unified Vector-Matrix Optimization
🔎 Similar Papers
No similar papers found.
Zitao Song
Zitao Song
Purdue CS
Decision ScienceMachine LearningReinforcement Learning
Cedar Site Bai
Cedar Site Bai
Purdue University, Ph.D. Candidate of Computer Science
Machine LearningOptimization
Z
Zhe Zhang
School of Industrial Engineering, Purdue University, West Lafayette, IN, USA
Brian Bullins
Brian Bullins
Assistant Professor, Purdue University
OptimizationMachine Learning
D
David F. Gleich
Department of Computer Science, Purdue University, West Lafayette, IN, USA