SubTrack your Grad: Gradient Subspace Tracking for Memory and Time Efficient Full-Parameter LLM Training

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitive memory and time overhead of full-parameter training for large language models, this paper proposes a dynamic gradient subspace tracking method grounded in the Grassmann manifold. Innovatively integrating estimation error correction with recursive historical subspace updates, the approach enables efficient and stable subspace evolution within a low-rank gradient modeling framework. It achieves memory savings comparable to GaLore while substantially improving training efficiency: up to 20.57% and 65% speedups on GLUE and SuperGLUE benchmarks, respectively, and a 49% reduction in wall-clock time for a 3B-parameter model. The core contribution lies in the first incorporation of an error-aware mechanism into Grassmann-manifold-based subspace tracking—thereby jointly optimizing accuracy, convergence behavior, and computational efficiency.

Technology Category

Application Category

📝 Abstract
Training Large Language Models (LLMs) demand significant time and computational resources due to their large model sizes and optimizer states. To overcome these challenges, recent methods, such as BAdam, employ partial weight updates to enhance time and memory efficiency, though sometimes at the cost of performance. Others, like GaLore, focus on maintaining performance while optimizing memory usage through full parameter training, but may incur higher time complexity. By leveraging the low-rank structure of the gradient and the Grassmannian geometry, we propose SubTrack-Grad, a subspace tracking-based optimization method that efficiently tracks the evolving gradient subspace by incorporating estimation errors and previously identified subspaces. SubTrack-Grad delivers better or on-par results compared to GaLore, while significantly outperforming BAdam, which, despite being time-efficient, compromises performance. SubTrack-Grad reduces wall-time by up to 20.57% on GLUE tasks (15% average reduction) and up to 65% on SuperGLUE tasks (22% average reduction) compared to GaLore. Notably, for a 3B parameter model, GaLore incurred a substantial 157% increase in wall-time compared to full-rank training, whereas SubTrack-Grad exhibited a 31% increase, representing a 49% reduction in wall-time, while enjoying the same memory reductions as GaLore.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Efficient Training
Resource Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

SubTrack-Grad
Gradient Subspace Tracking
Large Model Training Efficiency
🔎 Similar Papers
No similar papers found.