🤖 AI Summary
The quadratic time complexity $O(n^2)$ of dot-product attention in Transformers severely hinders inference efficiency for long-context applications. To address this, we propose a training-free, model-agnostic framework that dynamically selects salient tokens without architectural modification or fine-tuning—enabling zero-shot acceleration of any pretrained Transformer. Our method quantifies token importance probabilistically via attention scores and employs a theoretically grounded sparse sampling strategy to compress context while preserving critical tokens with high probability. This reduces decoding complexity to approximately $O(nk)$, where $k ll n$. Extensive experiments across diverse tasks and architectures demonstrate that our approach maintains state-of-the-art performance while significantly accelerating inference—eliminating the need for heuristic truncation or costly retraining.
📝 Abstract
Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.