In-context KV-Cache Eviction for LLMs via Attention-Gate

📅 2024-10-15
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF

career value

223K/year
🤖 AI Summary
To address memory and computational bottlenecks induced by KV-Cache in large language model (LLM) inference, this work proposes a dynamic, context-aware KV-cache eviction mechanism. The core method introduces a lightweight Attention-Gate module that enables layer- and head-granular adaptive eviction decisions, facilitating non-intrusive fine-tuning of pretrained models without architectural modification. Our approach integrates hierarchical attention masking, continual pretraining, and supervised fine-tuning to achieve fine-grained, dynamic pruning of KV caches. Experiments across multiple benchmarks demonstrate substantial improvements: up to 42% reduction in GPU memory consumption, up to 31% decrease in end-to-end latency, and consistent gains in generation quality (e.g., +1.8 BLEU, +0.9 ROUGE-L). To the best of our knowledge, this is the first method to jointly enhance both inference efficiency and generation performance through KV-cache optimization.

Technology Category

Application Category

📝 Abstract
The KV-Cache technique has become the standard for the inference of large language models (LLMs). Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system. This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model. It accepts the global context as input and yields eviction flags for each token. The self-attention modules in the model proceed according to the flags and cache only a subset of the KV states for next token prediction. The Attention-Gates can yield various flags for different heads and layers and be easily tuned on top of a pre-trained LLM via continual pre-training or supervised fine-tuning. The computational and memory overhead introduced by Attention-Gates can be minimal. We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.
Problem

Research questions and friction points this paper is trying to address.

Dynamic KV-Cache eviction for efficient LLM inference
Lightweight Attention-Gate module for token-level cache control
Reducing computational and memory overhead in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic KV-Cache eviction via Attention-Gate
Lightweight module for token eviction flags
Minimal overhead with continual pre-training