Understanding and Improving the Shampoo Optimizer via Kullback-Leibler Minimization

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This paper identifies a theoretical deficiency in the Shampoo optimizer: its reliance on the Frobenius norm for second-moment estimation leads to numerical instability. We reformulate Shampoo’s update as a Kullback–Leibler (KL) divergence minimization problem from a covariance estimation perspective—its first such characterization. Based on this insight, we propose KL-Shampoo, a novel adaptive optimizer that explicitly minimizes KL divergence without requiring auxiliary Adam-style corrections and achieves superior memory efficiency. Theoretically, KL-Shampoo alleviates Shampoo’s sensitivity to ill-conditioning by improving the conditioning of preconditioning matrices. Empirically, on large language model pretraining, KL-Shampoo converges faster and demonstrates greater stability than both Shampoo and the Adam variant SOAP, while reducing GPU memory consumption significantly.

Technology Category

Application Category

📝 Abstract

As an adaptive method, Shampoo employs a structured second-moment estimation, and its effectiveness has attracted growing attention. Prior work has primarily analyzed its estimation scheme through the Frobenius norm. Motivated by the natural connection between the second moment and a covariance matrix, we propose studying Shampoo's estimation as covariance estimation through the lens of Kullback-Leibler (KL) minimization. This alternative perspective reveals a previously hidden limitation, motivating improvements to Shampoo's design. Building on this insight, we develop a practical estimation scheme, termed KL-Shampoo, that eliminates Shampoo's reliance on Adam for stabilization, thereby removing the additional memory overhead introduced by Adam. Preliminary results show that KL-Shampoo improves Shampoo's performance, enabling it to stabilize without Adam and even outperform its Adam-stabilized variant, SOAP, in neural network pretraining.

Problem

Research questions and friction points this paper is trying to address.

Analyzing Shampoo optimizer via KL minimization perspective

Identifying hidden limitations in Shampoo's estimation scheme

Developing improved KL-Shampoo to eliminate Adam dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

KL minimization for covariance estimation

Eliminates reliance on Adam stabilization

Reduces memory overhead in optimization

🔎 Similar Papers

Characterizing Dynamical Stability of Stochastic Gradient Descent in Overparameterized Learning