Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference

πŸ“… 2025-02-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the low retrieval efficiency and high memory overhead of KV cache in long-context LLM inference, this paper proposes ActQKVβ€”a training-free, activation-aware KV compression method. First, it introduces the Activation Bias metric to dynamically construct probe queries during prefilling that capture global contextual semantics. Second, it models inter-layer token activation density to adaptively prune KV entries within a sliding window, enabling layer-aware dynamic truncation. Crucially, ActQKV is the first approach to jointly exploit sparse attention patterns and token-level activation states, balancing retrieval accuracy and computational efficiency. Evaluated on Long-Bench and ∞Bench, ActQKV achieves state-of-the-art performance, reducing GPU memory consumption and decoding latency significantly while preserving generation quality comparable to full KV caching.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in large language models (LLMs) have showcased exceptional performance in long-context tasks, while facing significant inference efficiency challenges with limited GPU memory. Existing solutions first proposed the sliding-window approach to accumulate a set of historical extbf{key-value} (KV) pairs for reuse, then further improvements selectively retain its subsets at each step. However, due to the sparse attention distribution across a long context, it is hard to identify and recall relevant KV pairs, as the attention is distracted by massive candidate pairs. Additionally, we found it promising to select representative tokens as probe-Query in each sliding window to effectively represent the entire context, which is an approach overlooked by existing methods. Thus, we propose extbf{ActQKV}, a training-free, extbf{Act}ivation-aware approach that dynamically determines probe- extbf{Q}uery and leverages it to retrieve the relevant extbf{KV} pairs for inference. Specifically, ActQKV monitors a token-level indicator, Activation Bias, within each context window, enabling the proper construction of probe-Query for retrieval at pre-filling stage. To accurately recall the relevant KV pairs and minimize the irrelevant ones, we design a dynamic KV cut-off mechanism guided by information density across layers at the decoding stage. Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enhances key-value retrieval efficiency in LLMs.
Reduces GPU memory usage during long-context inference.
Improves attention focus by selecting representative tokens.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation-aware probe-Query selection
Dynamic KV cut-off mechanism
Training-free ActQKV approach
πŸ”Ž Similar Papers
No similar papers found.
Qingfa Xiao
Qingfa Xiao
The Phd student of Hong Kong University of Science and Technology (Guangzhou)
Natural Language ProcessingContrastive LearningLarge Language Model
J
Jiachuan Wang
The Hong Kong University of Science and Technology
H
Haoyang Li
The Hong Kong Polytechnic University
Cheng Deng
Cheng Deng
University of Edinburgh
On-device LLMNLPGeoAI
J
Jiaqi Tang
The Hong Kong University of Science and Technology (Guangzhou)
S
Shuangyin Li
South China Normal University
Yongqi Zhang
Yongqi Zhang
Assistant Professor in HKUST(GZ)
Graph learningDrug discoveryDeep learning
J
Jun Wang
University College London
L
Lei Chen
The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology (Guangzhou)