XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive KV cache memory overhead of large language models (LLMs) in long-context processing—hindering deployment in resource-constrained settings—this paper proposes a training-free, plug-and-play ultra-low-bit KV cache quantization framework. Our method innovatively integrates data-free calibration with cross-layer cache compression, achieving, for the first time, equivalent 1.38-bit KV quantization (sub-1.4 bit), substantially outperforming existing 1.5-bit and 2-bit approaches. The framework enables end-to-end low-bit storage and computation without fine-tuning or auxiliary data. Evaluated on TruthfulQA and LongBench benchmarks, it surpasses KIVI-2bit and AsymKV-1.5bit even at ultra-low bitwidths, establishing a new state-of-the-art trade-off between memory compression ratio and model accuracy.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.
Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache memory usage in LLMs
Achieving ultra-low bit quantization without training
Maintaining accuracy while compressing cross-layer cache
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free plug-and-play KV cache quantization framework
Data-free calibration method with negligible computational cost
Cross-layer compression enabling sub-1.4 bit quantization
🔎 Similar Papers
2024-01-31Neural Information Processing SystemsCitations: 193
H
Haoqi Yang
School of Computer Science, Wuhan University, Wuhan, China
Y
Yao Yao
School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Zuchao Li
Zuchao Li
Wuhan University
Natural Language ProcessingMachine Learning
B
Baoyuan Qi
Xiaomi Inc., Beijing, China
G
Guoming Liu
Xiaomi Inc., Beijing, China
H
Hai Zhao
School of Computer Science, Shanghai Jiao Tong University, Shanghai, China