LCQ: Low-Rank Codebook based Quantization for Large Language Models

📅 2024-05-31
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing low-rank codebook quantization methods for large language models (LLMs) struggle to balance storage efficiency and accuracy, particularly under high compression ratios where quantization-induced precision degradation becomes severe. To address this, we propose Low-Rank Codebook Quantization (LRCQ), the first method to incorporate low-rank matrices of rank greater than one into LLM weight quantization. LRCQ jointly optimizes low-rank decomposition and learnable codebooks, achieving substantial accuracy recovery at high compression rates without incurring significant additional storage overhead. Evaluated across multiple standard benchmarks, LRCQ achieves an average accuracy improvement of 1.8% over state-of-the-art 4-bit quantization methods—including GPTQ and AWQ—demonstrating both its effectiveness and practical applicability for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract
Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
Problem

Research questions and friction points this paper is trying to address.

reduce storage and computational cost
improve weight quantization accuracy
minimize accuracy loss at high compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank codebook based quantization
Reduces LLM storage and computational costs
Improves accuracy with minimal extra storage
W
Wen-Pu Cai
National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
Wu-Jun Li
Wu-Jun Li
Nanjing University
Artificial IntelligenceMachine LearningBig Data