🤖 AI Summary
To address the high computational cost of attention computation in long-context LLM inference and the sharp performance degradation of existing KV cache compression methods—caused by premature discarding of critical tokens due to static, one-time compression during long-text generation—this paper proposes a dynamic attention switching mechanism. It alternates between full-context attention and sparse attention that reuses historical attention patterns during decoding. Key contributions include: (i) the first attention pattern reuse mechanism, enabling query-similarity-driven adaptive switching between full and reused attention; (ii) dynamic K-token subset selection with threshold-based control; and (iii) exploration of reuse-guided continual pretraining. Evaluated on long-context benchmarks including RULER, our method achieves a 2× improvement in task performance over local attention baselines at the same speedup ratio.
📝 Abstract
Generating long sequences of tokens given a long-context input imposes a heavy computational burden for large language models (LLMs). One of the computational bottleneck comes from computing attention over a long sequence of input at each generation step. In this paper, we propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens, reducing the cost of data movement and attention computation. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step. We evaluate our methods on RULER, a suite of tasks designed to comprehensively evaluate long-context abilities, and long-context language modeling tasks. Applying our method to off-the-shelf LLMs achieves comparable speedup to baselines which only consider local context while improving the performance by 2x. We further explore two ideas to improve performance-efficiency trade-offs: (1) dynamically decide when to perform recycled or full attention step based on the query similarities and (2) continued pre-training the model with Recycled Attention.