Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Shampoo, while effective in large-scale training, relies on heuristic designs—such as learning-rate grafting and outdated preconditioning—lacking theoretical grounding, resulting in hyperparameter sensitivity and implementation complexity. Method: We propose a principled, heuristic-free Kronecker-based optimization paradigm: (i) approximate full-matrix Adam via Frobenius-norm minimization, decoupling eigenvalue and eigenvector updates of the preconditioner; (ii) directly correct eigenvalues to eliminate the need for learning-rate grafting; and (iii) introduce a warm-started QR-based termination criterion to adaptively update eigenvectors per Kronecker factor, enabling independent update frequencies. Contribution/Results: Experiments demonstrate that our method preserves convergence speed and stability while substantially reducing hyperparameter sensitivity. It establishes a new pathway toward interpretable, theoretically rigorous large-scale Kronecker optimization.

Technology Category

Application Category

📝 Abstract

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly can eliminate the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

Problem

Research questions and friction points this paper is trying to address.

Investigates Shampoo's heuristics for neural network optimization

Decouples preconditioner's eigenvalue and eigenbasis updates

Proposes adaptive criterion for eigenbasis computation frequency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples preconditioner's eigenvalues and eigenbasis updates

Proposes adaptive criterion for eigenbasis computation frequency

Corrects eigenvalues to eliminate learning rate grafting

🔎 Similar Papers

Learning from Linear Algebra: A Graph Neural Network Approach to Preconditioner Design for Conjugate Gradient Solvers