Radar: Fast Long-Context Decoding for Any Transformer

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

The quadratic time complexity $O(n^2)$ of dot-product attention in Transformers severely hinders inference efficiency for long-context applications. To address this, we propose a training-free, model-agnostic framework that dynamically selects salient tokens without architectural modification or fine-tuning—enabling zero-shot acceleration of any pretrained Transformer. Our method quantifies token importance probabilistically via attention scores and employs a theoretically grounded sparse sampling strategy to compress context while preserving critical tokens with high probability. This reduces decoding complexity to approximately $O(nk)$, where $k ll n$. Extensive experiments across diverse tasks and architectures demonstrate that our approach maintains state-of-the-art performance while significantly accelerating inference—eliminating the need for heuristic truncation or costly retraining.

Technology Category

Application Category

📝 Abstract

Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

Problem

Research questions and friction points this paper is trying to address.

Accelerates Transformer inference for long-context data

Reduces decoding time complexity without retraining

Identifies important tokens dynamically for efficient processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token search for efficient decoding

Training-free acceleration of Transformer inference

Reduced time complexity for long-context processing

🔎 Similar Papers

Streaming Sequence Transduction through Dynamic Compression