AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing heuristic- or attention-score–based static ranking methods for KV cache compression in long-context LLM inference overlook the spatiotemporal dynamics of attention scores, leading to inaccurate critical token identification and performance degradation. This work proposes the first learning-based framework for critical token identification: (1) a lightweight convolutional network models the temporal evolution of attention scores; (2) a cross-token KV cache prefetching mechanism implicitly amortizes prediction overhead; and (3) dynamic, fine-grained KV cache compression is achieved. Evaluated on standard long-context benchmarks, our method achieves a 16× KV cache compression ratio while retaining over 98% of the original model’s accuracy—significantly outperforming current state-of-the-art approaches. The framework bridges the gap between attention dynamics modeling and efficient memory management, enabling scalable long-context inference without compromising model fidelity.

Technology Category

Application Category

📝 Abstract
With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the extit{temporal patterns} in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based critical token identification approach. Specifically, AttentionPredictor learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature of AttentionPredictor is that it accurately predicts the attention score while consuming negligible memory. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 16$ imes$ KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Efficient KV cache compression
Temporal patterns in attention scores
Critical token identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning-based critical token identification
Lightweight convolution model usage
Cross-token cache prefetching framework
🔎 Similar Papers
No similar papers found.
Q
Qingyue Yang
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
J
Jie Wang
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
X
Xing Li
Noah’s Ark Lab, Huawei Technologies
Zhihai Wang
Zhihai Wang
Qwen Team, Phd, USTC
Sample-Efficient Reinforcement LearningRL4LLMAgentic RL
C
Chen Chen
Noah’s Ark Lab, Huawei Technologies
L
Lei Chen
Noah’s Ark Lab, Huawei Technologies
Xianzhi Yu
Xianzhi Yu
Unknown affiliation
AIHPC
Wulong Liu
Wulong Liu
Unknown affiliation
Reinforcement LearningAutonomous DrivingRoboticsAI InfraEDA
Jianye Hao
Jianye Hao
Huawei Noah's Ark Lab/Tianjin University
Multiagent SystemsEmbodied AI
M
Mingxuan Yuan
Noah’s Ark Lab, Huawei Technologies
B
Bin Li
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China