Memory-Efficient 4-bit Preconditioned Stochastic Optimization

📅 2024-12-14
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Preconditioned stochastic optimizers (e.g., Shampoo) incur prohibitive memory overhead in large neural network training due to storage of dense, non-diagonal preconditioning matrices. Method: We propose the first 4-bit quantized Shampoo preconditioner—novelly quantizing Cholesky decomposition factors at ultra-low bit-width while preserving positive definiteness and symmetry. Crucially, we encode the quantized factors and error feedback in the upper and lower triangular parts of the same matrix, respectively, integrating structured error feedback, 4-bit quantization, and non-diagonal preconditioning. Contributions/Results: Our method maintains theoretical convergence guarantees for both smooth and nonsmooth stochastic optimization, while drastically reducing memory consumption. Empirical evaluation on large-scale model training demonstrates competitive performance and robustness. To our knowledge, this is the first work to achieve rigorous convergence with sub-8-bit quantization in second-order preconditioned optimization.

Technology Category

Application Category

📝 Abstract
Preconditioned stochastic optimization algorithms, exemplified by Shampoo, have demonstrated superior performance over first-order optimizers, providing both theoretical advantages in convergence rates and practical improvements in large-scale neural network training. However, they incur substantial memory overhead due to the storage demands of non-diagonal preconditioning matrices. To address this, we introduce 4-bit quantization for Shampoo's preconditioners. We introduced two key methods: First, we apply Cholesky decomposition followed by quantization of the Cholesky factors, reducing memory usage by leveraging their lower triangular structure while preserving symmetry and positive definiteness to minimize information loss. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. Second, we incorporate error feedback in the quantization process, efficiently storing Cholesky factors and error states in the lower and upper triangular parts of the same matrix. Through extensive experiments, we demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance in large-scale deep-learning tasks. Theoretically, we also provide convergence proofs for quantized Shampoo under both smooth and non-smooth stochastic optimization settings.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory overhead in preconditioned stochastic optimization.
Introduces 4-bit quantization for Shampoo's preconditioners.
Enhances memory efficiency in large-scale deep-learning tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

4-bit quantization for Shampoo preconditioners
Cholesky decomposition with quantization
Error feedback in quantization process
🔎 Similar Papers
No similar papers found.