Low-rank Momentum Factorization for Memory Efficient Training

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high memory overhead of stateful optimizers (e.g., AdamW) in large language model (LLM) fine-tuning, this paper proposes LowRankOpt—a memory-efficient optimization method based on dynamic low-rank momentum decomposition. Methodologically, it maintains an online low-rank singular value decomposition (SVD) approximation of the momentum matrix, integrated with spectral-normalized gradient updates and an adaptive subspace adjustment mechanism—thereby avoiding fixed-subspace bias and costly offline resampling. Theoretically, convergence is guaranteed under non-convex stochastic optimization assumptions. Empirically, LowRankOpt achieves alignment performance on par with or superior to mainstream low-rank adapters (e.g., LoRA), while its GPU memory footprint matches that of LoRA and is substantially lower than AdamW. The approach thus offers a favorable trade-off among computational efficiency, training stability, and scalability for LLM fine-tuning.

Technology Category

Application Category

📝 Abstract
Fine-tuning large foundation models presents significant memory challenges due to stateful optimizers like AdamW, often requiring several times more GPU memory than inference. While memory-efficient methods like parameter-efficient fine-tuning (e.g., LoRA) and optimizer state compression exist, recent approaches like GaLore bridge these by using low-rank gradient projections and subspace moment accumulation. However, such methods may struggle with fixed subspaces or computationally costly offline resampling (e.g., requiring full-matrix SVDs). We propose Momentum Factorized SGD (MoFaSGD), which maintains a dynamically updated low-rank SVD representation of the first-order momentum, closely approximating its full-rank counterpart throughout training. This factorization enables a memory-efficient fine-tuning method that adaptively updates the optimization subspace at each iteration. Crucially, MoFaSGD leverages the computed low-rank momentum factors to perform efficient spectrally normalized updates, offering an alternative to subspace moment accumulation. We establish theoretical convergence guarantees for MoFaSGD, proving it achieves an optimal rate for non-convex stochastic optimization under standard assumptions. Empirically, we demonstrate MoFaSGD's effectiveness on large language model alignment benchmarks, achieving a competitive trade-off between memory reduction (comparable to LoRA) and performance compared to state-of-the-art low-rank optimization methods. Our implementation is available at https://github.com/pmahdavi/MoFaSGD.
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient fine-tuning of large foundation models
Dynamic low-rank SVD for optimizer state compression
Balancing memory reduction and training performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically updates low-rank SVD momentum
Adaptively optimizes subspace each iteration
Efficient spectrally normalized updates technique
🔎 Similar Papers
No similar papers found.
P
Pouria Mahdavinia
Department of Computer Science and Engineering, The Pennsylvania State University
Mehrdad Mahdavi
Mehrdad Mahdavi
Hartz Family Associate Professor of Computer Science @ Penn State
Machine LearningOptimization TheoryLearning Theory