๐ค AI Summary
็ ็ฉถ็จ็ๆณจๆๅๅฆไฝ่ฟไผผๆ ๅๆณจๆๅ๏ผ้่ฟ็่ฎบๆกๆถๆญ็คบๆ ๅๆณจๆๅๅ
ทๆๅคฉ็ถ็จ็ๆง๏ผๆๅบ่ช้ๅบ็ชๅฃ็ญ็ฅๆ้ซๆ็ๅๅ็กฎๆงใ
๐ Abstract
Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to $ extbf{bridge this gap by examining the inherent sparsity of standard attention processes}$. Our theoretical framework reveals several brand-new key insights: $ullet$ Attention is $n^{C}$-sparse, implying that considering only the largest $Omega(n^{C})$ entries out of all $n$ entries is sufficient for sparse attention to approximate the exact attention matrix with decreasing loss. Here, $n$ represents the input length and $C in (0, 1)$ is a constant. $ullet$ Stable $o(log(n))$-sparse attention, which approximates attention computation with $log(n)$ or fewer entries, may not be feasible since the error will persist at a minimum of $O(1)$. $ullet$ An adaptive strategy ($alpha cdot n^C, alpha in mathbb{R}$) for the window size of efficient attention methods rather than a fixed one is guaranteed to perform more accurately and efficiently in a task for inference on flexible context lengths.