How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

📅 2024-04-03

📈 Citations: 4

✨ Influential: 0

📄 PDF

career value

229K/year

🤖 AI Summary

研究稀疏注意力如何近似标准注意力，通过理论框架揭示标准注意力具有天然稀疏性，提出自适应窗口策略提高效率和准确性。

Technology Category

Application Category

📝 Abstract

Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to $ extbf{bridge this gap by examining the inherent sparsity of standard attention processes}$. Our theoretical framework reveals several brand-new key insights: $ullet$ Attention is $n^{C}$-sparse, implying that considering only the largest $Omega(n^{C})$ entries out of all $n$ entries is sufficient for sparse attention to approximate the exact attention matrix with decreasing loss. Here, $n$ represents the input length and $C in (0, 1)$ is a constant. $ullet$ Stable $o(log(n))$-sparse attention, which approximates attention computation with $log(n)$ or fewer entries, may not be feasible since the error will persist at a minimum of $O(1)$. $ullet$ An adaptive strategy ($alpha cdot n^C, alpha in mathbb{R}$) for the window size of efficient attention methods rather than a fixed one is guaranteed to perform more accurately and efficiently in a task for inference on flexible context lengths.

Problem

Research questions and friction points this paper is trying to address.

Theoretical understanding of sparse attention performance conditions.

Examining inherent sparsity in standard attention processes.

Developing adaptive strategies for efficient attention computation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention reduces complexity via selective softmax computation.

Adaptive window size enhances accuracy and efficiency.

Attention matrix approximated by largest entries, ensuring performance.

🔎 Similar Papers

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency