Accurate KV Cache Quantization with Outlier Tokens Tracing

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

KV cache quantization reduces memory overhead in large language model (LLM) inference, yet outlier tokens—deviating significantly from typical statistical distributions—severely degrade low-bit quantization accuracy. This work is the first to systematically identify and characterize such statistically anomalous tokens in KV caches. We propose a dynamic outlier-aware quantization framework: (1) a lightweight, real-time tracking mechanism that online detects and masks outliers during decoding, bypassing them from quantization; and (2) a joint quantization strategy—channel-wise for Keys and token-wise for Values—enabling aggressive 2-bit compression. Experiments demonstrate that our method achieves a 6.4× memory reduction over FP16 while substantially improving quantization fidelity and boosting throughput by 2.3×, thereby overcoming the limitations of conventional uniform quantization paradigms.

Technology Category

Application Category

📝 Abstract

The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.

Problem

Research questions and friction points this paper is trying to address.

Identify outlier tokens affecting KV Cache quantization accuracy

Reduce memory overhead while maintaining LLM inference accuracy

Improve quantization performance for Keys and Values separately

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identify outlier tokens during decoding

Exclude outlier tokens from quantization

Achieve high accuracy with 2-bit quantization

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference