🤖 AI Summary
To address the extreme compression requirements for deploying large language models (LLMs) on resource-constrained devices, this paper proposes Channel-Restraint Vector Quantization (CRVQ), the first post-training quantization (PTQ) method achieving high performance under <2-bit weight quantization—including near-lossless 1-bit. CRVQ innovatively integrates channel importance estimation and reordering with an extended codebook design to relax constraints on critical channels, enabling flexible bit–accuracy trade-offs. Evaluated on mainstream LLMs, CRVQ achieves an average 38.9% improvement in task performance over the strongest sub-2-bit PTQ baselines. It supports customizable bit-width quantization and cross-platform deployment across diverse hardware, significantly reducing computational overhead without compromising model accuracy.
📝 Abstract
Powerful large language models (LLMs) are increasingly expected to be deployed with lower computational costs, enabling their capabilities on resource-constrained devices. Post-training quantization (PTQ) has emerged as a star approach to achieve this ambition, with best methods compressing weights to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector Quantization (CRVQ), a novel technique that significantly improves the performance of PTQ baselines at the cost of only minimal additional bits. This state-of-the-art extreme compression method achieves its results through two key innovations: (1) carefully selecting and reordering a very small subset of critical weight channels, and (2) leveraging extended codebooks to relax the constraint of critical channels. With our method, we demonstrate a 38.9% improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer lossless 1-bit compression. Furthermore, our approach offers flexible customization of quantization bit-width and performance, providing a wider range of deployment options for diverse hardware platforms.