🤖 AI Summary
Large language models (LLMs) suffer from quadratic attention complexity and explosive KV cache memory consumption and latency when processing long contexts (≥32K tokens). To address this, we propose the first comprehensive methodology classification framework for KV cache compression, unifying key techniques—including quantization, pruning, low-rank approximation, sequence grouping, and locality-aware reuse—across both theoretical principles and practical implementation dimensions. Through standardized cross-method evaluation, we characterize the fundamental trade-offs among accuracy, inference speed, and memory footprint. Experimental results demonstrate that, with perplexity degradation under 1%, the optimal compression strategy achieves 2–5× KV cache memory reduction and 1.3–2.1× inference speedup, substantially enhancing the feasibility of deploying LLMs in long-context scenarios.
📝 Abstract
Large language models (LLMs) have demonstrated exceptional capabilities in generating text, images, and video content. However, as context length grows, the computational cost of attention increases quadratically with the number of tokens, presenting significant efficiency challenges. This paper presents an analysis of various Key-Value (KV) cache compression strategies, offering a comprehensive taxonomy that categorizes these methods by their underlying principles and implementation techniques. Furthermore, we evaluate their impact on performance and inference latency, providing critical insights into their effectiveness. Our findings highlight the trade-offs involved in KV cache compression and its influence on handling long-context scenarios, paving the way for more efficient LLM implementations.