🤖 AI Summary
To address the high computational overhead and low efficiency of activation gradient computation during backpropagation in LoRA fine-tuning, this work identifies this step as a critical compute bottleneck—first such characterization in the literature. We propose a lightweight fine-tuning framework balancing computational and memory efficiency: (1) a sparsified approximate matrix multiplication to reduce gradient computation complexity; (2) a Double-LoRA dual-path gradient estimation mechanism that jointly suppresses approximation error; and (3) a low-rank parameter update scheme. We theoretically establish an $O(1/sqrt{T})$ convergence rate. Experiments demonstrate that, compared to standard LoRA, our method significantly reduces FLOPs and training time—achieving up to 47% lower computational cost and 2.1× faster training—while preserving near-identical model performance across multiple benchmark tasks.
📝 Abstract
Large Language Models (LLMs) demonstrate exceptional performance across various tasks but demand substantial computational resources even for fine-tuning computation. Although Low-Rank Adaptation (LoRA) significantly alleviates memory consumption during fine-tuning, its impact on computational cost reduction is limited. This paper identifies the computation of activation gradients as the primary bottleneck in LoRA's backward propagation and introduces the Computation-Efficient LoRA (CE-LoRA) algorithm, which enhances computational efficiency while preserving memory efficiency. CE-LoRA leverages two key techniques: Approximated Matrix Multiplication, which replaces dense multiplications of large and complete matrices with sparse multiplications involving only critical rows and columns, and the Double-LoRA technique, which reduces error propagation in activation gradients. Theoretically, CE-LoRA converges at the same rate as LoRA, $ mathcal{O}(1/sqrt{T}) $, where $T$ is the number of iteartions. Empirical evaluations confirm that CE-LoRA significantly reduces computational costs compared to LoRA without notable performance degradation.