ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance collapse under low-bit quantization and the incompatibility between quantization and fine-tuning in edge deployment of large language models (LLMs), this paper proposes a novel compression-and-fine-tuning paradigm integrating weight block-wise clustering with codebook-based representation. Methodologically, model weights are partitioned into blocks; compact codebooks are constructed via clustering, and block-wise differentiable parameter updates alongside mixed-precision gradient optimization are introduced. Key contributions include: (1) achieving, for the first time, superior accuracy over state-of-the-art methods (e.g., BitNet, LLM-QAT) under 1-bit quantization; (2) enabling full compression and fine-tuning of a 70B-parameter LLM on a single A6000 GPU (48 GB VRAM); and (3) attaining significant gains over mainstream weight-only quantization schemes at 2–4 bits—particularly, 2-bit quantization retains over 90% of FP16 full fine-tuning accuracy and matches FP16 baselines across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.
Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation in low-bit weight-only quantization.
Enables efficient finetuning of quantized large language models.
Supports compression and finetuning of 70B LLMs on limited hardware.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters weight matrices into codebooks
Finetunes models block-by-block efficiently
Supports 1-bit compression with minimal finetuning