🤖 AI Summary
To address the excessive KV cache memory overhead of large language models (LLMs) in long-context processing—hindering deployment in resource-constrained settings—this paper proposes a training-free, plug-and-play ultra-low-bit KV cache quantization framework. Our method innovatively integrates data-free calibration with cross-layer cache compression, achieving, for the first time, equivalent 1.38-bit KV quantization (sub-1.4 bit), substantially outperforming existing 1.5-bit and 2-bit approaches. The framework enables end-to-end low-bit storage and computation without fine-tuning or auxiliary data. Evaluated on TruthfulQA and LongBench benchmarks, it surpasses KIVI-2bit and AsymKV-1.5bit even at ultra-low bitwidths, establishing a new state-of-the-art trade-off between memory compression ratio and model accuracy.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.