Accurate KV Cache Quantization with Outlier Tokens Tracing

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
KV cache quantization reduces memory overhead in large language model (LLM) inference, yet outlier tokens—deviating significantly from typical statistical distributions—severely degrade low-bit quantization accuracy. This work is the first to systematically identify and characterize such statistically anomalous tokens in KV caches. We propose a dynamic outlier-aware quantization framework: (1) a lightweight, real-time tracking mechanism that online detects and masks outliers during decoding, bypassing them from quantization; and (2) a joint quantization strategy—channel-wise for Keys and token-wise for Values—enabling aggressive 2-bit compression. Experiments demonstrate that our method achieves a 6.4× memory reduction over FP16 while substantially improving quantization fidelity and boosting throughput by 2.3×, thereby overcoming the limitations of conventional uniform quantization paradigms.

Technology Category

Application Category

📝 Abstract
The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
Problem

Research questions and friction points this paper is trying to address.

Identify outlier tokens affecting KV Cache quantization accuracy
Reduce memory overhead while maintaining LLM inference accuracy
Improve quantization performance for Keys and Values separately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identify outlier tokens during decoding
Exclude outlier tokens from quantization
Achieve high accuracy with 2-bit quantization
🔎 Similar Papers
No similar papers found.
Y
Yi Su
School of Computer Science and Technology, Soochow University; Key Laboratory of Data Intelligence and Advanced Computing, Soochow University
Y
Yuechi Zhou
School of Computer Science and Technology, Soochow University; Key Laboratory of Data Intelligence and Advanced Computing, Soochow University
Quantong Qiu
Quantong Qiu
Soochow University
LLMSparse AttentionKV Cache
Juntao Li
Juntao Li
Soochow University
Language ModelsText Generation
Qingrong Xia
Qingrong Xia
Soochow University
NLP
P
Ping Li
Huawei Cloud
Xinyu Duan
Xinyu Duan
Huawei Cloud
LLMInference Optimization
Zhefeng Wang
Zhefeng Wang
Huawei Cloud
NLPAI systemLLMmulti-modalityMachine Learning
M
Min Zhang
School of Computer Science and Technology, Soochow University; Key Laboratory of Data Intelligence and Advanced Computing, Soochow University