🤖 AI Summary
To address the high computational cost and resource consumption of large language model (LLM) inference deployment, this paper proposes a unified sparse-plus-low-rank weight decomposition framework. Methodologically, it introduces, for the first time, an intra-layer local reconstruction error minimization objective, rigorously solving the original non-relaxed optimization problem. The framework jointly supports end-to-end training with 2:4 semi-structured sparsity and rank-controllable low-rank approximation (e.g., truncated SVD), enabled by an adaptive gradient optimization strategy. Experiments on Llama3-8B demonstrate a 12% reduction in WikiText-2 perplexity and a 15% average reduction in performance gap across eight zero-shot tasks, significantly outperforming existing compression methods. The core contributions lie in (i) theoretically rigorous joint optimization modeling and (ii) a hardware-friendly, efficient, and deployable compression paradigm.
📝 Abstract
The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the exact introduced optimization problem. HASSLE-free substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, HASSLE-free reduces the test perplexity by 12% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by 15% compared to existing methods.