Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work investigates the convergence properties of an AdamW-style Shampoo optimizer, aiming to unify the theoretical frameworks of one-sided and two-sided preconditioning. By introducing a nuclear norm metric, the study establishes, for the first time, a convergence rate bound for this optimizer in non-convex stochastic optimization and reveals its intrinsic connection to the optimal rate of SGD. Leveraging tools from stochastic optimization theory, matrix analysis, and preconditioning techniques—along with the relationship between nuclear and Frobenius norms—the authors prove that the average gradient nuclear norm converges at a rate of $O\left(\frac{\sqrt{m+n}\,C}{K^{1/4}}\right)$. Under ideal conditions, this rate matches the optimal SGD rate of $O\left(\frac{C}{K^{1/4}}\right)$, thereby providing new theoretical justification for adaptive second-order optimization methods.

Technology Category

Application Category

📝 Abstract

This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $\|\nabla f(X)\|_*= \Theta(\sqrt{m+n})\|\nabla f(X)\|_F$.

Problem

Research questions and friction points this paper is trying to address.

convergence rate

AdamW-style Shampoo

preconditioning

optimization

neural network training

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdamW-style Shampoo

convergence rate analysis

one-sided preconditioning