Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks a unified explanation for the emergence of diverse attention patterns—such as retrieval heads, sink heads, and diagonal traces—in large language models. This work proposes TAPPA, a novel framework that establishes the first unified theory from a temporally continuous perspective, categorizing attention patterns into predictable and unpredictable types. It identifies temporal self-similarity in queries as the key determinant of predictability. By jointly analyzing queries, keys, and Rotary Position Embeddings (RoPE), TAPPA formulates a quantitative metric for attention predictability. Experimental results demonstrate that a simple indicator derived from this metric consistently outperforms existing baselines in KV cache compression and LLM pruning tasks, confirming its effectiveness and practical utility.

Technology Category

Application Category

📝 Abstract
Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.
Problem

Research questions and friction points this paper is trying to address.

attention patterns
large language models
temporal analysis
query self-similarity
unifying framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Attention Pattern Predictability Analysis
attention patterns
query self-similarity
RoPE
KV cache compression
🔎 Similar Papers
No similar papers found.
Q
Qingyue Yang
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
J
Jie Wang
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Xing Li
Xing Li
Huawei Noah's Ark Lab
LLM InferenceTest Time ScalingAgentic AILogic Synthesis
Y
Yinqi Bai
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
X
Xialiang Tong
Huawei Technologies Co., Ltd.
H
Huiling Zhen
Huawei Technologies Co., Ltd.
Jianye Hao
Jianye Hao
Huawei Noah's Ark Lab/Tianjin University
Multiagent SystemsEmbodied AI
M
Mingxuan Yuan
Huawei Technologies Co., Ltd.
B
Bin Li
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China