Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale optimization, diagonal adaptive methods (e.g., AdaGrad) fail to capture parameter correlations, while full-matrix approaches—though capable of approximating the Hessian and improving convergence—suffer from prohibitive computational and memory costs. To address this trade-off, we propose AdaGram, the first optimizer that integrates dynamic low-rank approximation, fast symmetric matrix decomposition, and matrix integration techniques to efficiently maintain a low-rank preconditioning matrix at each iteration, enabling full-matrix adaptive gradient updates. Its core innovation lies in approximating exact second-order information with *O*(*dr*) time and space complexity, where *r* ≪ *d* denotes the rank. Empirical results demonstrate that with only *r* = 1–5, AdaGram matches or surpasses the convergence speed of state-of-the-art diagonal adaptive optimizers across diverse standard benchmarks, achieving an unprecedented balance between representational capacity and scalability.

Technology Category

Application Category

📝 Abstract
Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods, approximating the exact Hessian, can model these correlations and may enable faster convergence. At the same time, their computational and memory costs are often prohibitive for large-scale models. To address this limitation, we propose AdaGram, an optimizer that enables efficient full-matrix adaptive gradient updates. To reduce memory and computational overhead, we utilize fast symmetric factorization for computing the preconditioned update direction at each iteration. Additionally, we maintain the low-rank structure of a preconditioner along the optimization trajectory using matrix integrator methods. Numerical experiments on standard machine learning tasks show that AdaGram converges faster or matches the performance of diagonal adaptive optimizers when using rank five and smaller rank approximations. This demonstrates AdaGram's potential as a scalable solution for adaptive optimization in large models.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of full-matrix adaptive optimization methods
Capturing parameter correlations beyond diagonal preconditioning limitations
Maintaining low-rank preconditioner structure during model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank approximation for full-matrix preconditioning
Fast symmetric factorization for update direction computation
Matrix integrator methods maintaining preconditioner structure
🔎 Similar Papers
No similar papers found.