Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

📅 2026-01-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the high computational and memory costs of fine-tuning large language models, which remain challenging for deployment in resource-constrained settings even with existing low-rank methods that still rely on dense weights. The authors propose SALR, a method that jointly integrates static sparse pruning and low-rank adaptation within a mean squared error framework. They theoretically establish that freezing the base model weights and applying pruning alone minimizes pruning error, and further recover residual information via truncated singular value decomposition to construct low-rank adapters. By incorporating multi-adapter fusion, bitmap encoding, and GEMM optimization, SALR achieves genuine model compression and acceleration. Experiments show that SALR matches LoRA’s performance on GSM8K and MMLU while reducing model size by 50% and accelerating inference by 1.7×.

Technology Category

Application Category

📝 Abstract

Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA's performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of $(1 - r/\min(d,k))$. To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50\% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by $2\times$, and delivers up to a $1.7\times$ inference speedup.

Problem

Research questions and friction points this paper is trying to address.

efficient fine-tuning

large language models

sparsity

low-rank adaptation

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Adaptation

Sparsity-Aware Pruning

Truncated SVD