CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High inference costs and deployment challenges plague large language models (LLMs), especially under ≤3-bit ultra-low-bit quantization, where severe accuracy degradation, poor hardware efficiency, and limited scalability persist. To address these issues, we propose Convolutional Code Quantization (CCQ), the first lookup-table-free linear mapping framework integrating convolutional coding, hybrid codebook clustering, and shift-based decoding—thereby breaking the traditional accuracy–latency trade-off inherent in scalar and vector quantization. CCQ achieves lossless accuracy recovery at 2.0–2.75 bits, compressing DeepSeek-V3 and ERNIE-4.5-300B-A47B to 184 GB and 89 GB, respectively, enabling efficient single-GPU deployment. We open-source both the 2-bit quantized models and a lightweight inference engine, significantly enhancing practicality and hardware compatibility of ultra-low-bit LLMs.

Technology Category

Application Category

📝 Abstract
The rapid scaling of Large Language Models (LLMs) elevates inference costs and compounds substantial deployment barriers. While quantization to 8 or 4 bits mitigates this, sub-3-bit methods face severe accuracy, scalability, and efficiency degradation. We propose Convolutional Code Quantization (CCQ), an inference-optimized quantization approach compressing LLMs to 2.0-2.75 bits with minimal accuracy loss. Departing from error-prone scalar quantization or slow vector quantization, CCQ integrates a hardware-aware bit-shift encoding and decoding solution with Convolutional Code, Hybrid Encoding, and Code Cluster, jointly overcoming accuracy-speed bottlenecks. We construct a lookup-free encoding space, enabling a linear mapping between the codebook and weight vectors, thereby optimizing inference performance. Meanwhile, by drawing on the concept of data mapping from vector quantization, we minimize the performance degradation of the model under extremely low-bit conditions. Experiments demonstrate that CCQ achieves outstanding performance on LLMs across various benchmarks. We compress DeepSeek-V3 (671B total parameters) to 184GB and ERNIE-4.5-300B-A47B to 89GB, enabling single-GPU deployment of ERNIE 4.5 and eliminating inter-card communication. The 2-bit ERNIE-4.5-300B-A47B model and inference engine have been open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM inference costs via extreme low-bit quantization
Overcoming accuracy degradation in sub-3-bit quantization methods
Enabling single-GPU deployment for billion-parameter LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convolutional Code Quantization for 2-3 bit LLMs
Hardware-aware bit-shift encoding and decoding
Lookup-free linear mapping for inference optimization
🔎 Similar Papers
No similar papers found.
Z
Zhaojing Zhou
Baidu Inc.
X
Xunchao Li
Baidu Inc.
Minghao Li
Minghao Li
Beihang University
Natural Language Processing
H
Handi Zhang
Baidu Inc.
H
Haoshuang Wang
Baidu Inc.
W
Wenbin Chang
Baidu Inc.
Y
Yiqun Liu
Baidu Inc.
Q
Qingqing Dang
Baidu Inc.
Dianhai Yu
Dianhai Yu
Baidu
Deep LearningNatural Language ProcessingMachine LearningArtificial intelligence
Y
Yanjun Ma
Baidu Inc.
H
Haifeng Wang
Baidu Inc.