🤖 AI Summary
This paper identifies a theoretical deficiency in the Shampoo optimizer: its reliance on the Frobenius norm for second-moment estimation leads to numerical instability. We reformulate Shampoo’s update as a Kullback–Leibler (KL) divergence minimization problem from a covariance estimation perspective—its first such characterization. Based on this insight, we propose KL-Shampoo, a novel adaptive optimizer that explicitly minimizes KL divergence without requiring auxiliary Adam-style corrections and achieves superior memory efficiency. Theoretically, KL-Shampoo alleviates Shampoo’s sensitivity to ill-conditioning by improving the conditioning of preconditioning matrices. Empirically, on large language model pretraining, KL-Shampoo converges faster and demonstrates greater stability than both Shampoo and the Adam variant SOAP, while reducing GPU memory consumption significantly.
📝 Abstract
As an adaptive method, Shampoo employs a structured second-moment estimation, and its effectiveness has attracted growing attention. Prior work has primarily analyzed its estimation scheme through the Frobenius norm. Motivated by the natural connection between the second moment and a covariance matrix, we propose studying Shampoo's estimation as covariance estimation through the lens of Kullback-Leibler (KL) minimization. This alternative perspective reveals a previously hidden limitation, motivating improvements to Shampoo's design. Building on this insight, we develop a practical estimation scheme, termed KL-Shampoo, that eliminates Shampoo's reliance on Adam for stabilization, thereby removing the additional memory overhead introduced by Adam. Preliminary results show that KL-Shampoo improves Shampoo's performance, enabling it to stabilize without Adam and even outperform its Adam-stabilized variant, SOAP, in neural network pretraining.