HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and resource consumption of large language model (LLM) inference deployment, this paper proposes a unified sparse-plus-low-rank weight decomposition framework. Methodologically, it introduces, for the first time, an intra-layer local reconstruction error minimization objective, rigorously solving the original non-relaxed optimization problem. The framework jointly supports end-to-end training with 2:4 semi-structured sparsity and rank-controllable low-rank approximation (e.g., truncated SVD), enabled by an adaptive gradient optimization strategy. Experiments on Llama3-8B demonstrate a 12% reduction in WikiText-2 perplexity and a 15% average reduction in performance gap across eight zero-shot tasks, significantly outperforming existing compression methods. The core contributions lie in (i) theoretically rigorous joint optimization modeling and (ii) a hardware-friendly, efficient, and deployable compression paradigm.

Technology Category

Application Category

📝 Abstract
The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the exact introduced optimization problem. HASSLE-free substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, HASSLE-free reduces the test perplexity by 12% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by 15% compared to existing methods.
Problem

Research questions and friction points this paper is trying to address.

Cost Reduction
Large Language Models
Performance Maintenance
Innovation

Methods, ideas, or system contributions that make the work stand out.

HASSLE-free framework
Large pre-trained language model compression
Decomposition accuracy optimization
🔎 Similar Papers
No similar papers found.