🤖 AI Summary
This work addresses the problem of efficiently approximating the output of attention mechanisms within a bounded query space while guaranteeing uniform approximation accuracy for all queries with bounded norm. Leveraging tools from high-dimensional geometry and probabilistic analysis, the authors construct sparse subsets of key-value pairs—termed ε-coresets—that satisfy a uniform error bound. The main contribution lies in establishing the existence of an ε-coreset of size O(√d·e^{ρ+o(ρ)}/ε) and proving a matching lower bound of Ω(√d·e^ρ/ε), thereby significantly improving upon existing results and nearly tightly characterizing the optimal size of attention coresets.
📝 Abstract
We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{ρ+o(ρ)}/\varepsilon})$ such that \[ \left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by $ρ$. This outperforms the best known results for this problem. We also offer an improved lower bound showing that $\varepsilon$-coresets must have size $Ω({\sqrt{d} e^ρ/ε})$.