🤖 AI Summary
This work addresses the efficient computation of Gaussian kernel matrix–vector multiplication (for asymmetric kernel matrices). We propose a subquadratic-time algorithm that, under a sparsity assumption—namely, that row and column sums of the kernel matrix grow linearly with input size—achieves time complexity $O(n^{2-alpha} d)$ ($alpha > 0$), linear space complexity, and provable $L_2$-norm error guarantees for arbitrary query-key vector pairs. This is the first such result for general Gaussian kernels, breaking the standard $O(n^2 d)$ barrier inherent in naive attention computation. The method directly accelerates the core attention mechanism in large language models (LLMs). Empirical evaluation validates the sparsity assumption on real LLM attention matrices and demonstrates superior trade-offs between accuracy and efficiency compared to baseline approaches.
📝 Abstract
Motivated by the problem of fast processing of attention matrices, we study fast algorithms for computing matrix-vector products for asymmetric Gaussian Kernel matrices $Kin mathbb{R}^{n imes n}$. $K$'s columns are indexed by a set of $n$ keys $k_1,k_2ldots, k_nin mathbb{R}^d$, rows by a set of $n$ queries $q_1,q_2,ldots,q_nin mathbb{R}^d $, and its $i,j$ entry is $K_{ij} = e^{-|q_i-k_j|_2^2/2σ^2}$ for some bandwidth parameter $σ>0$. Given a vector $xin mathbb{R}^n$ and error parameter $ε>0$, our task is to output a $yin mathbb{R}^n$ such that $|Kx-y|_2leq ε|x|_2$ in time subquadratic in $n$ and linear in $d$. Our algorithms rely on the following modelling assumption about the matrices $K$: the sum of the entries of $K$ scales linearly in $n$, as opposed to worst case quadratic growth. We validate this assumption experimentally, for Gaussian kernel matrices encountered in various settings such as fast attention computation in LLMs. We obtain the first subquadratic-time algorithm that works under this assumption, for unrestricted vectors.