Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

In long-sequence Transformer inference, sparse attention reduces computational complexity but induces output distribution shift, causing decoding misalignment and substantial accuracy degradation. This paper proposes Delta Correction—a novel mechanism that dynamically aligns the output distributions of sparse and dense attention for the first time. The method is general-purpose and plug-and-play, compatible with diverse sparse attention schemes. Integrated with sliding windows and sink tokens, it recovers 88% of full-attention accuracy on the 131K RULER benchmark (averaging a 36-percentage-point gain) and achieves 32× speedup over FlashAttention-2 under 1M-token prefilling, while sustaining a 98.5% sparsity rate. Its core contribution lies in breaking the accuracy–efficiency trade-off bottleneck in sparse attention through distributional consistency—establishing a principled, distribution-aware framework for high-fidelity sparse inference.

Technology Category

Application Category

📝 Abstract

The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.

Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic complexity in transformer attention mechanisms

Addressing performance degradation in sparse attention inference

Correcting distributional shift in sparse attention outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Delta correction for sparse attention distribution shift

Compatible with any sparse attention method

Maintains high sparsity with minimal overhead

🔎 Similar Papers

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention