In-context KV-Cache Eviction for LLMs via Attention-Gate

๐Ÿ“… 2024-10-15
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address memory and computational bottlenecks induced by KV-Cache in large language model (LLM) inference, this work proposes a dynamic, context-aware KV-cache eviction mechanism. The core method introduces a lightweight Attention-Gate module that enables layer- and head-granular adaptive eviction decisions, facilitating non-intrusive fine-tuning of pretrained models without architectural modification. Our approach integrates hierarchical attention masking, continual pretraining, and supervised fine-tuning to achieve fine-grained, dynamic pruning of KV caches. Experiments across multiple benchmarks demonstrate substantial improvements: up to 42% reduction in GPU memory consumption, up to 31% decrease in end-to-end latency, and consistent gains in generation quality (e.g., +1.8 BLEU, +0.9 ROUGE-L). To the best of our knowledge, this is the first method to jointly enhance both inference efficiency and generation performance through KV-cache optimization.

Technology Category

Application Category

๐Ÿ“ Abstract
The KV-Cache technique has become the standard for the inference of large language models (LLMs). Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system. This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model. It accepts the global context as input and yields eviction flags for each token. The self-attention modules in the model proceed according to the flags and cache only a subset of the KV states for next token prediction. The Attention-Gates can yield various flags for different heads and layers and be easily tuned on top of a pre-trained LLM via continual pre-training or supervised fine-tuning. The computational and memory overhead introduced by Attention-Gates can be minimal. We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.
Problem

Research questions and friction points this paper is trying to address.

Dynamic KV-Cache eviction for efficient LLM inference
Lightweight Attention-Gate module for token-level cache control
Reducing computational and memory overhead in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic KV-Cache eviction via Attention-Gate
Lightweight module for token eviction flags
Minimal overhead with continual pre-training
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zihao Zeng
Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University
B
Bokai Lin
Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University
Tianqi Hou
Tianqi Hou
Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.
statistical physicsmachine learning๏ผŒhigh-dimensional statisticsComputational Neuroscience
H
Hao Zhang
University of California, San Diego
Z
Zhijie Deng
Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University