LCQ: Low-Rank Codebook based Quantization for Large Language Models

📅 2024-05-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing low-rank codebook quantization methods for large language models (LLMs) struggle to balance storage efficiency and accuracy, particularly under high compression ratios where quantization-induced precision degradation becomes severe. To address this, we propose Low-Rank Codebook Quantization (LRCQ), the first method to incorporate low-rank matrices of rank greater than one into LLM weight quantization. LRCQ jointly optimizes low-rank decomposition and learnable codebooks, achieving substantial accuracy recovery at high compression rates without incurring significant additional storage overhead. Evaluated across multiple standard benchmarks, LRCQ achieves an average accuracy improvement of 1.8% over state-of-the-art 4-bit quantization methods—including GPTQ and AWQ—demonstrating both its effectiveness and practical applicability for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract

Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.

Problem

Research questions and friction points this paper is trying to address.

reduce storage and computational cost

improve weight quantization accuracy

minimize accuracy loss at high compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank codebook based quantization

Reduces LLM storage and computational costs

Improves accuracy with minimal extra storage

🔎 Similar Papers

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices