VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

πŸ“… 2025-10-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high memory overhead of KV caching and the severe performance degradation of ultra-low-bit vector quantization (VQ) caused by outliers in LLM inference, this paper proposes VecInfer. Our method jointly applies smoothing and Hadamard transformation to suppress outliers in key caches, thereby improving codebook coverage; it further designs a fused low-bit VQ and dequantization CUDA kernel to reduce memory accesses and computational latency. The core innovations lie in outlier-aware quantization preprocessing and hardware-coordinated optimization. Evaluated on Llama-3.1-8B, VecInfer achieves near–full-precision generation quality under 2-bit quantization. It boosts large-batch self-attention throughput by 2.7Γ— and reduces single-batch end-to-end latency by 8.3Γ—, while enabling efficient inference on sequences up to 196K tokens.

Technology Category

Application Category

πŸ“ Abstract
The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $mathbf{2.7 imes}$ speedup in large-batch self-attention computation and $mathbf{8.3 imes}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.
Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache memory overhead in LLM inference
Suppresses key cache outliers for low-bit quantization
Enables efficient deployment with optimized computation kernels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Suppresses key cache outliers via smooth transformations
Fuses computation with dequantization in CUDA kernel
Enables 2-bit KV cache quantization with minimal degradation
πŸ”Ž Similar Papers
D
Dingyu Yao
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Chenxu Yang
Chenxu Yang
Institute of Information Engineering, Chinese Academy of Sciences
NLPDialogue Generation
Z
Zhengyang Tong
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Z
Zheng Lin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
W
Wei Liu
MiLM Plus, Xiaomi Inc, Beijing, China
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security