On the Duality between Gradient Transformations and Adapters

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the excessive memory overhead of gradients and optimizer states in large-model training. We propose a low-dimensional optimization framework based on linear gradient transformations: gradients are projected onto a low-dimensional subspace for update computation and then mapped back to the original parameter space. Theoretically, we establish, for the first time, a rigorous duality between gradient transformations and additive linear adapter parameter reparameterizations (e.g., LoRA), proving that Kronecker-structured gradient transformations are equivalent to one-sided LoRA and unifying GaLore, LoRA, and related methods. Methodologically, we integrate linear projection, Kronecker decomposition, and low-rank optimization to derive a novel paradigm that jointly optimizes memory efficiency and computational effectiveness. Experiments demonstrate substantial reductions in gradient storage and optimizer-state memory consumption—without compromising update fidelity—thereby providing both a unified theoretical foundation and a practical pathway for efficient training under memory constraints.

Technology Category

Application Category

📝 Abstract
We study memory-efficient optimization of neural networks with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map's transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter's parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient neural network optimization
Linear gradient transformations for reduced dimensionality
Equivalence between gradient transformations and adapter reparameterizations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear gradient transformation reduces memory
Kronecker-factored transformation links GaLore and LoRA
Adapter-based reparameterization enhances training efficiency
🔎 Similar Papers
No similar papers found.