SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Adaptive optimizers in large language model (LLM) training incur high memory overhead, while SVD-based low-rank gradient projection suffers from expensive computation and storage costs. Method: This paper proposes an SVD-free lightweight low-rank gradient projection method. Instead of layer-wise SVD, it employs a predefined orthogonal discrete cosine transform (DCT) basis and introduces an adaptive column-index selection mechanism aligned with gradient directions; only sparse indices—not the full projection matrix—are stored, and projection requires only a single matrix multiplication followed by lightweight sorting. Contribution/Results: Experiments demonstrate that the method matches the performance of SVD-based baselines in both pretraining and fine-tuning, while accelerating training and significantly reducing GPU memory consumption—particularly beneficial for efficient optimization of billion- to trillion-parameter models.

Technology Category

Application Category

📝 Abstract

Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory usage in LLM adaptive optimizers via low-rank optimization

Avoiding costly SVD computations for gradient projections in large models

Approximating SVD with efficient predefined orthogonal bases and adaptive selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses DCT for orthogonal basis construction

Adaptively selects basis columns for gradients

Avoids SVD with predefined lightweight matrices

🔎 Similar Papers

Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization