Dynamic Low-Rank Sparse Adaptation for Large Language Models

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial performance degradation of large language models (LLMs) under sparse deployment, the difficulty of post-training integration of LoRA weights, and insufficient recovery at high sparsity levels, this paper proposes Dynamic Low-Rank Sparse Adaptation (LoSA). LoSA innovatively jointly optimizes low-rank adaptation and structured sparsity: it allocates layer-wise adaptive sparsity ratios based on representational mutual information (RMI) and dynamically adjusts the LoRA rank per layer according to reconstruction error, enabling fine-grained accuracy compensation. Crucially, LoSA enables efficient fine-tuning and seamless weight integration for sparse LLMs without increasing inference latency. Evaluated on LLaMA-2-7B, LoSA reduces perplexity by 68.73 and improves zero-shot accuracy by 16.32%. It achieves 2.60× CPU and 2.23× GPU inference speedup, and completes full-model fine-tuning on a single A100 GPU in just 45 minutes.

Technology Category

Application Category

📝 Abstract
Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$%$, achieving a 2.60$ imes$ speedup on CPU and 2.23$ imes$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.
Problem

Research questions and friction points this paper is trying to address.

Enhance sparse LLMs performance without latency increase.
Integrate LoRA into sparse LLMs post-training dynamically.
Optimize layer-wise sparsity rates using RMI indicators.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Low-Rank Sparse Adaptation
Integrates low-rank into sparsity
Uses Representation Mutual Information
🔎 Similar Papers
No similar papers found.
W
Weizhong Huang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Y
Yuxin Zhang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML
Y
Yang Liu
Huawei Technologies
J
Jing Lin
Huawei Technologies
Yiwu Yao
Yiwu Yao
Peking University
Artificial Intelligence
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.; Institute of Artificial Intelligence, Xiamen University