🤖 AI Summary
To address the high computational complexity (O(n²)) of Transformer self-attention, which impedes inference efficiency, this paper proposes Top-Theta sparsification. Our method replaces conventional top-k pruning with adaptive dynamic thresholding, eliminating full-vector dependencies and costly global search. We further introduce a numerical compensation mechanism and V-cache row compression to enable zero-retraining, single-calibration adaptation across diverse data distributions. Experiments show that Top-Theta reduces V-cache memory by 3× during decoding and decreases attention computation elements by 10× during prefill, while preserving accuracy with negligible degradation (ΔAcc < 0.3%). The core contribution is the first-ever compensation-aware threshold pruning mechanism—uniquely balancing inference efficiency, cross-distribution generalizability, and accuracy robustness.
📝 Abstract
The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$ heta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$ heta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.