🤖 AI Summary
This work addresses the vulnerability of Softmax attention to irrelevant tokens—so-called “attention sinks”—and the dispersion of attention probabilities over long sequences, which degrade model performance. To mitigate these issues, the authors propose Thresholded Differentiable Attention (TDA), a novel mechanism that integrates row-wise extreme-value thresholding, a length-dependent gating function, and inhibitory attention views. TDA achieves ultra-sparse, sink-free, and non-diffuse attention distributions without incurring additional projection costs. It generates over 99% exact-zero attention weights, confines spurious activations to a constant level, and ensures that cross-view false matches asymptotically vanish as context length increases. Empirical results demonstrate that TDA maintains competitive performance on both standard and long-context benchmarks.
📝 Abstract
Softmax attention struggles with long contexts due to structural limitations: the strict sum-to-one constraint forces attention sinks on irrelevant tokens, and probability mass disperses as sequence lengths increase. We tackle these problems with Threshold Differential Attention (TDA), a sink-free attention mechanism that achieves ultra-sparsity and improved robustness at longer sequence lengths without the computational overhead of projection methods or the performance degradation caused by noise accumulation of standard rectified attention. TDA applies row-wise extreme-value thresholding with a length-dependent gate, retaining only exceedances. Inspired by the differential transformer, TDA also subtracts an inhibitory view to enhance expressivity. Theoretically, we prove that TDA controls the expected number of spurious survivors per row to $O(1)$ and that consensus spurious matches across independent views vanish as context grows. Empirically, TDA produces $>99\%$ exact zeros and eliminates attention sinks while maintaining competitive performance on standard and long-context benchmarks.