🤖 AI Summary
To address the substantial performance degradation of large language models (LLMs) under sparse deployment, the difficulty of post-training integration of LoRA weights, and insufficient recovery at high sparsity levels, this paper proposes Dynamic Low-Rank Sparse Adaptation (LoSA). LoSA innovatively jointly optimizes low-rank adaptation and structured sparsity: it allocates layer-wise adaptive sparsity ratios based on representational mutual information (RMI) and dynamically adjusts the LoRA rank per layer according to reconstruction error, enabling fine-grained accuracy compensation. Crucially, LoSA enables efficient fine-tuning and seamless weight integration for sparse LLMs without increasing inference latency. Evaluated on LLaMA-2-7B, LoSA reduces perplexity by 68.73 and improves zero-shot accuracy by 16.32%. It achieves 2.60× CPU and 2.23× GPU inference speedup, and completes full-model fine-tuning on a single A100 GPU in just 45 minutes.
📝 Abstract
Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$%$, achieving a 2.60$ imes$ speedup on CPU and 2.23$ imes$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.