🤖 AI Summary
Existing low-rank codebook quantization methods for large language models (LLMs) struggle to balance storage efficiency and accuracy, particularly under high compression ratios where quantization-induced precision degradation becomes severe. To address this, we propose Low-Rank Codebook Quantization (LRCQ), the first method to incorporate low-rank matrices of rank greater than one into LLM weight quantization. LRCQ jointly optimizes low-rank decomposition and learnable codebooks, achieving substantial accuracy recovery at high compression rates without incurring significant additional storage overhead. Evaluated across multiple standard benchmarks, LRCQ achieves an average accuracy improvement of 1.8% over state-of-the-art 4-bit quantization methods—including GPTQ and AWQ—demonstrating both its effectiveness and practical applicability for efficient LLM deployment.
📝 Abstract
Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.