🤖 AI Summary
This work addresses the high computational cost of prefilling in large language and multimodal models under long-context scenarios, where existing pruning methods rely on heuristic rules and struggle to integrate with efficient attention kernels. The authors propose Delta Attention Selective Halting (DASH), which for the first time links token semantic stability to computational redundancy by dynamically identifying semantically converged tokens through monitoring inter-layer differences in self-attention updates and halting their further computation. DASH requires no additional training, is compatible with efficient attention kernels such as FlashAttention, and achieves significant acceleration of the prefill phase across both language and vision benchmarks while preserving model accuracy—effectively balancing speed, precision, and hardware efficiency.
📝 Abstract
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.