UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory overhead of KV caches and the lack of prefill acceleration in long-context reasoning for large language models, this paper proposes an uncertainty-aware compression method based on matrix entropy. It is the first to uncover and exploit structured sparsity between hidden states and their corresponding KV caches, designing an adaptive, token-level and layer-/head-granular grouping compression mechanism that jointly compresses both components. The method is fully compatible with Grouped-Query Attention and requires no fine-tuning. Key contributions include: (1) the first KV cache compression scheme enabling prefill acceleration; (2) post-compression performance surpassing full-cache baselines on needle-in-a-haystack tasks. Experiments show 4.74% KV cache size reduction, 1.6× prefill speedup, 6.4× throughput improvement, 1.4× end-to-end inference acceleration, and only a 1.41% accuracy drop.

Technology Category

Application Category

📝 Abstract
Deploying large language models (LLMs) is challenging due to their high memory and computational demands, especially during long-context inference. While key-value (KV) caching accelerates inference by reusing previously computed keys and values, it also introduces significant memory overhead. Existing KV cache compression methods such as eviction and merging typically compress the KV cache after it is generated and overlook the eviction of hidden states, failing to improve the speed of the prefilling stage. Additionally, applying a uniform compression rate across different attention heads can harm crucial retrieval heads in needle-in-a-haystack tasks due to excessive compression. In this paper, we propose UNComp, an uncertainty-aware compression scheme that leverages matrix entropy to estimate model uncertainty across layers and heads at the token sequence level. By grouping layers and heads based on their uncertainty, UNComp adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size, resulting in a 6.4x increase in throughput and a 1.4x speedup in inference with only a 1.41% performance loss. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size. Our approach offers an efficient, training-free Grouped-Query Attention paradigm that can be seamlessly integrated into existing KV cache schemes.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory demands of LLMs during long-context inference
Identifying sparsity patterns in KV cache using uncertainty measures
Developing adaptive compression that preserves model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses matrix entropy to identify sparsity patterns
Dynamically adjusts compression based on uncertainty
Reduces KV cache size by exploiting information content
🔎 Similar Papers
No similar papers found.