XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the excessive KV cache memory overhead of large language models (LLMs) in long-context processing—hindering deployment in resource-constrained settings—this paper proposes a training-free, plug-and-play ultra-low-bit KV cache quantization framework. Our method innovatively integrates data-free calibration with cross-layer cache compression, achieving, for the first time, equivalent 1.38-bit KV quantization (sub-1.4 bit), substantially outperforming existing 1.5-bit and 2-bit approaches. The framework enables end-to-end low-bit storage and computation without fine-tuning or auxiliary data. Evaluated on TruthfulQA and LongBench benchmarks, it surpasses KIVI-2bit and AsymKV-1.5bit even at ultra-low bitwidths, establishing a new state-of-the-art trade-off between memory compression ratio and model accuracy.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache memory usage in LLMs

Achieving ultra-low bit quantization without training

Maintaining accuracy while compressing cross-layer cache

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free plug-and-play KV cache quantization framework

Data-free calibration method with negligible computational cost

Cross-layer compression enabling sub-1.4 bit quantization

🔎 Similar Papers

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

2024-01-31Neural Information Processing SystemsCitations: 193

Authors to Follow