🤖 AI Summary
To address the substantial accuracy degradation in post-training quantization (PTQ) of large language models (LLMs) at ultra-low bit-widths, this paper proposes Grouped Grid Vector Quantization (GGVQ). GGVQ learns group-specific lattice-generating matrices to construct non-uniform, adaptive codebooks; it employs Babai rounding for differentiable approximate nearest-neighbor search, enabling end-to-end optimization and efficient decoding. Unlike conventional uniform quantization, GGVQ preserves low inference latency while markedly improving accuracy at 2–4-bit precision. Extensive experiments demonstrate that GGVQ consistently outperforms state-of-the-art PTQ methods across multiple benchmarks, achieving a synergistic optimization of high accuracy and computational efficiency—particularly beneficial for resource-constrained deployment scenarios.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.