🤖 AI Summary
This work addresses the poorly understood functional roles of attention heads in long-context modeling. We propose a lightweight, locality-based method that dynamically identifies attention heads requiring long-range information using only local keys. We empirically discover and theoretically establish a bimodal (local vs. long-range) behavior of attention heads in long-context settings; crucially, we prove that long-range attention scores can be accurately approximated via second-order moment statistics—revealing an intrinsic statistical simplicity. Building on this insight, we design a head selection mechanism comprising three components: local-key-based prediction, second-order statistical approximation, and head-level adaptive filtering. Experiments across Llama and Qwen models demonstrate that our approach achieves high-precision identification of long-range heads with negligible overhead, reducing inference FLOPs by up to 42% for long sequences. This establishes a novel paradigm for efficient long-context reasoning in large language models.
📝 Abstract
The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.