🤖 AI Summary
This work challenges the prevailing consensus that structured preconditioners merely serve as computational approximations to full-matrix methods. We systematically analyze the theoretical performance of hierarchical, diagonal, and Kronecker-factored structured preconditioners in adaptive optimization. We propose a unified analytical framework integrating matrix preconditioning theory, online regret analysis, and convex optimization convergence proofs. Our key contribution is the first theoretical demonstration that simpler structured preconditioners—such as diagonal AdaGrad and one-sided Shampoo—not only achieve the tight convergence rate of full-matrix AdaGrad but can even attain strictly superior convergence under certain conditions. Empirical evaluation confirms that these structured variants outperform their full-matrix counterparts both in training speed and generalization accuracy. Our results fundamentally reshape the conceptual understanding of structured preconditioning, establishing a new paradigm for efficient adaptive optimization that bridges rigorous theoretical guarantees with practical efficacy.
📝 Abstract
We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.