Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial accuracy degradation in post-training quantization (PTQ) of large language models (LLMs) at ultra-low bit-widths, this paper proposes Grouped Grid Vector Quantization (GGVQ). GGVQ learns group-specific lattice-generating matrices to construct non-uniform, adaptive codebooks; it employs Babai rounding for differentiable approximate nearest-neighbor search, enabling end-to-end optimization and efficient decoding. Unlike conventional uniform quantization, GGVQ preserves low inference latency while markedly improving accuracy at 2–4-bit precision. Extensive experiments demonstrate that GGVQ consistently outperforms state-of-the-art PTQ methods across multiple benchmarks, achieving a synergistic optimization of high accuracy and computational efficiency—particularly beneficial for resource-constrained deployment scenarios.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM computational demands through low-bit quantization
Overcoming performance degradation in uniform quantization methods
Developing efficient lattice vector quantizers for resource-constrained deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses grouped lattice vector quantization for compression
Employs learnable generation matrices per weight group
Applies Babai rounding for differentiable training process
🔎 Similar Papers
No similar papers found.