🤖 AI Summary
This work identifies a previously unrecognized phenomenon in Transformer large language models: highly concentrated large-magnitude values in the query (Q) and key (K) matrices of the first attention layer, directly induced by Rotary Position Embedding (RoPE) and functionally dedicated to contextual knowledge understanding—not parametric knowledge retrieval.
Method: Through cross-model attention visualization (LLaMA, Qwen, etc.), quantitative ablation studies, RoPE gradient attribution analysis, and inter-layer value tracking, we systematically characterize this concentration pattern.
Contribution/Results: We establish, for the first time, that such large-value concentration constitutes a critical mechanism for contextual modeling—challenging the conventional assumption of uniform attention value distribution. Ablation experiments confirm that masking these large values degrades performance significantly on context-intensive tasks (e.g., long-context QA, few-shot reasoning). Open-sourced code validates the universality and reproducibility of our findings across architectures and scales.
📝 Abstract
Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model's parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at https://github.com/MingyuJ666/Rope_with_LLM.