Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the convergence properties of an AdamW-style Shampoo optimizer, aiming to unify the theoretical frameworks of one-sided and two-sided preconditioning. By introducing a nuclear norm metric, the study establishes, for the first time, a convergence rate bound for this optimizer in non-convex stochastic optimization and reveals its intrinsic connection to the optimal rate of SGD. Leveraging tools from stochastic optimization theory, matrix analysis, and preconditioning techniques—along with the relationship between nuclear and Frobenius norms—the authors prove that the average gradient nuclear norm converges at a rate of $O\left(\frac{\sqrt{m+n}\,C}{K^{1/4}}\right)$. Under ideal conditions, this rate matches the optimal SGD rate of $O\left(\frac{C}{K^{1/4}}\right)$, thereby providing new theoretical justification for adaptive second-order optimization methods.

Technology Category

Application Category

📝 Abstract
This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $\|\nabla f(X)\|_*= \Theta(\sqrt{m+n})\|\nabla f(X)\|_F$.
Problem

Research questions and friction points this paper is trying to address.

convergence rate
AdamW-style Shampoo
preconditioning
optimization
neural network training
Innovation

Methods, ideas, or system contributions that make the work stand out.

AdamW-style Shampoo
convergence rate analysis
one-sided preconditioning
two-sided preconditioning
nuclear norm
🔎 Similar Papers
No similar papers found.