Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the challenge of efficiently processing long-context sequences with large language models, whose full attention mechanisms suffer from quadratic computational complexity. While existing sparse attention methods struggle to balance efficiency, accuracy, and training cost, this study reveals that full attention models inherently exhibit high internal sparsity. Building on this insight, the authors propose a novel paradigm that avoids costly sparse pretraining by identifying critical attention heads, constructing a lightweight 16-dimensional token indexer, retaining full KV cache only for retrieved heads, and modeling long-range dependencies via dynamic top-p token selection in a low-dimensional subspace. With merely hundreds of fine-tuning steps, the method achieves a 9.36× speedup in prefill and a 2.01× speedup in decode phases at million-token context lengths, while preserving near-lossless inference accuracy.
📝 Abstract
Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.
Problem

Research questions and friction points this paper is trying to address.

long-context inference
full attention
quadratic cost
sparse attention
efficiency-accuracy trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention
long-context LLMs
dynamic sparsification
low-dimensional retrieval
RTPurbo
🔎 Similar Papers
No similar papers found.