Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational complexity (O(n²)) of Transformer self-attention, which impedes inference efficiency, this paper proposes Top-Theta sparsification. Our method replaces conventional top-k pruning with adaptive dynamic thresholding, eliminating full-vector dependencies and costly global search. We further introduce a numerical compensation mechanism and V-cache row compression to enable zero-retraining, single-calibration adaptation across diverse data distributions. Experiments show that Top-Theta reduces V-cache memory by 3× during decoding and decreases attention computation elements by 10× during prefill, while preserving accuracy with negligible degradation (ΔAcc < 0.3%). The core contribution is the first-ever compensation-aware threshold pruning mechanism—uniquely balancing inference efficiency, cross-distribution generalizability, and accuracy robustness.

Technology Category

Application Category

📝 Abstract
The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$ heta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$ heta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational load in transformers
Prunes less essential attention elements
Preserves model accuracy without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective pruning of attention elements
Efficient numerical compensation techniques
No model retraining required
🔎 Similar Papers
No similar papers found.