CE-LoRA: Computation-Efficient LoRA Fine-Tuning for Language Models

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead and low efficiency of activation gradient computation during backpropagation in LoRA fine-tuning, this work identifies this step as a critical compute bottleneck—first such characterization in the literature. We propose a lightweight fine-tuning framework balancing computational and memory efficiency: (1) a sparsified approximate matrix multiplication to reduce gradient computation complexity; (2) a Double-LoRA dual-path gradient estimation mechanism that jointly suppresses approximation error; and (3) a low-rank parameter update scheme. We theoretically establish an $O(1/sqrt{T})$ convergence rate. Experiments demonstrate that, compared to standard LoRA, our method significantly reduces FLOPs and training time—achieving up to 47% lower computational cost and 2.1× faster training—while preserving near-identical model performance across multiple benchmark tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate exceptional performance across various tasks but demand substantial computational resources even for fine-tuning computation. Although Low-Rank Adaptation (LoRA) significantly alleviates memory consumption during fine-tuning, its impact on computational cost reduction is limited. This paper identifies the computation of activation gradients as the primary bottleneck in LoRA's backward propagation and introduces the Computation-Efficient LoRA (CE-LoRA) algorithm, which enhances computational efficiency while preserving memory efficiency. CE-LoRA leverages two key techniques: Approximated Matrix Multiplication, which replaces dense multiplications of large and complete matrices with sparse multiplications involving only critical rows and columns, and the Double-LoRA technique, which reduces error propagation in activation gradients. Theoretically, CE-LoRA converges at the same rate as LoRA, $ mathcal{O}(1/sqrt{T}) $, where $T$ is the number of iteartions. Empirical evaluations confirm that CE-LoRA significantly reduces computational costs compared to LoRA without notable performance degradation.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Fine-tuning
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

CE-LoRA
Efficient Fine-tuning
Matrix Multiplication Approximation
Guanduo Chen
Guanduo Chen
Moonshot AI
Y
Yutong He
Peking University
Y
Yipeng Hu
Peking University
K
Kun Yuan
Peking University
B
Binhang Yuan
HKUST