Understanding and Improving the Shampoo Optimizer via Kullback-Leibler Minimization

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a theoretical deficiency in the Shampoo optimizer: its reliance on the Frobenius norm for second-moment estimation leads to numerical instability. We reformulate Shampoo’s update as a Kullback–Leibler (KL) divergence minimization problem from a covariance estimation perspective—its first such characterization. Based on this insight, we propose KL-Shampoo, a novel adaptive optimizer that explicitly minimizes KL divergence without requiring auxiliary Adam-style corrections and achieves superior memory efficiency. Theoretically, KL-Shampoo alleviates Shampoo’s sensitivity to ill-conditioning by improving the conditioning of preconditioning matrices. Empirically, on large language model pretraining, KL-Shampoo converges faster and demonstrates greater stability than both Shampoo and the Adam variant SOAP, while reducing GPU memory consumption significantly.

Technology Category

Application Category

📝 Abstract
As an adaptive method, Shampoo employs a structured second-moment estimation, and its effectiveness has attracted growing attention. Prior work has primarily analyzed its estimation scheme through the Frobenius norm. Motivated by the natural connection between the second moment and a covariance matrix, we propose studying Shampoo's estimation as covariance estimation through the lens of Kullback-Leibler (KL) minimization. This alternative perspective reveals a previously hidden limitation, motivating improvements to Shampoo's design. Building on this insight, we develop a practical estimation scheme, termed KL-Shampoo, that eliminates Shampoo's reliance on Adam for stabilization, thereby removing the additional memory overhead introduced by Adam. Preliminary results show that KL-Shampoo improves Shampoo's performance, enabling it to stabilize without Adam and even outperform its Adam-stabilized variant, SOAP, in neural network pretraining.
Problem

Research questions and friction points this paper is trying to address.

Analyzing Shampoo optimizer via KL minimization perspective
Identifying hidden limitations in Shampoo's estimation scheme
Developing improved KL-Shampoo to eliminate Adam dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL minimization for covariance estimation
Eliminates reliance on Adam stabilization
Reduces memory overhead in optimization
🔎 Similar Papers
No similar papers found.