Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This work addresses the high computational cost of prefilling in large language and multimodal models under long-context scenarios, where existing pruning methods rely on heuristic rules and struggle to integrate with efficient attention kernels. The authors propose Delta Attention Selective Halting (DASH), which for the first time links token semantic stability to computational redundancy by dynamically identifying semantically converged tokens through monitoring inter-layer differences in self-attention updates and halting their further computation. DASH requires no additional training, is compatible with efficient attention kernels such as FlashAttention, and achieves significant acceleration of the prefill phase across both language and vision benchmarks while preserving model accuracy—effectively balancing speed, precision, and hardware efficiency.

Technology Category

Application Category

📝 Abstract
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.
Problem

Research questions and friction points this paper is trying to address.

prefilling
long-context
computational cost
Large Language Models
token redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Delta Attention
Selective Halting
Long-Context Prefilling
Semantic Fixed Points
Training-Free Pruning