🤖 AI Summary
This work addresses the high computational cost and inefficiency of dense attention during the prefilling phase in large language model inference. The authors propose a training-free, hardware-agnostic sparse attention algorithm that, for the first time, leverages the observation that queries with low cosine similarity dominate attention outcomes. By employing a query-guided strategy to select the most representative queries and their best-aligned key-value pairs, the method efficiently approximates full attention. Evaluated across multiple benchmarks, the approach achieves accuracy nearly on par with the dense baseline while reducing key-value pair usage by 88%. It delivers a 5× speedup on NVIDIA GPUs and nearly 7× on Intel Xeon CPUs, along with a 3× reduction in first-token latency.
📝 Abstract
We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QUOKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselectin the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3x reduction in time-to-first-token, 5x speedup in attention on Nvidia GPUs and up to nearly a 7x speedup on Intel Xeon CPUs, QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.