AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

📅 2024-10-23

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the excessive memory overhead of optimizer states in large language model (LLM) fine-tuning, this paper rigorously proves— for the first time—that the rank of layer-wise gradient matrices asymptotically converges to one during training. Leveraging this insight, we propose AdaptRank: an online low-rank projection update mechanism that eliminates the need for full-rank warm-up. AdaptRank integrates adaptive gradient rank reduction, randomized singular value decomposition (SVD), and a modified Adam optimizer, enabling significant compression of optimizer state memory while preserving full-parameter fine-tuning. Experiments demonstrate that AdaptRank achieves substantially lower GPU memory consumption than state-of-the-art methods such as LoRA, while simultaneously improving performance on both pretraining and downstream fine-tuning tasks. The method has been validated across both language and biological foundation models.

Technology Category

Application Category

📝 Abstract

Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging this, our approach involves adaptively reducing the rank of the gradients during Adam optimization steps, using an efficient online-updating low-rank projections rule. We further present a randomized SVD scheme for efficiently finding the projection matrix. Our technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, significantly reducing overall memory requirements during training compared to state-of-the-art methods while improving model performance in both pretraining and fine-tuning. Finally, we provide a convergence analysis of our method and demonstrate its merits for training and fine-tuning language and biological foundation models.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Resource Optimization

Fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdaRankGrad

Gradient Simplification

Memory-Efficient Training

🔎 Similar Papers

No similar papers found.