Support Basis: Fast Attention Beyond Bounded Entries

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

The quadratic time complexity of Softmax attention severely hinders the scalability of large language models (LLMs). Existing subquadratic approximation methods rely on unrealistic bounded-input assumptions, limiting their applicability to real-world LLMs. This paper introduces the Support Basis Decomposition framework, the first distribution-free approach enabling multi-threshold subquadratic attention. It partitions query and key matrices by magnitude into sparse (large-magnitude) and dense (small-magnitude) components: the former is computed exactly, while the latter is approximated via low-degree polynomials, accelerated by sketching techniques. We theoretically establish its subquadratic time complexity and further prove that real-world queries and keys satisfy sub-Gaussianity—a foundational property that provides the first theoretical justification for the empirical success of polynomial-based attention approximations. Experiments demonstrate substantial improvements in computational efficiency for large-scale attention.

Technology Category

Application Category

📝 Abstract

The quadratic complexity of softmax attention remains a central bottleneck in scaling large language models (LLMs). [Alman and Song, NeurIPS 2023] proposed a sub-quadratic attention approximation algorithm, but it works only under the restrictive bounded-entry assumption. Since this assumption rarely holds in practice, its applicability to modern LLMs is limited. In this paper, we introduce support-basis decomposition, a new framework for efficient attention approximation beyond bounded entries. We empirically demonstrate that the entries of the query and key matrices exhibit sub-Gaussian behavior. Our approach uses this property to split large and small entries, enabling exact computation on sparse components and polynomial approximation on dense components. We establish rigorous theoretical guarantees, proving a sub-quadratic runtime, and extend the method to a multi-threshold setting that eliminates all distributional assumptions. Furthermore, we provide the first theoretical justification for the empirical success of polynomial attention [Kacham, Mirrokni, and Zhong, ICML 2024], showing that softmax attention can be closely approximated by a combination of multiple polynomial attentions with sketching.

Problem

Research questions and friction points this paper is trying to address.

Overcoming quadratic complexity bottleneck in softmax attention

Extending attention approximation beyond restrictive bounded-entry assumption

Providing theoretical guarantees for efficient polynomial attention methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Support-basis decomposition for efficient attention approximation

Split large and small entries using sub-Gaussian properties

Combine exact sparse computation with polynomial dense approximation

🔎 Similar Papers

No similar papers found.

ByteDance

西雅图

Research Engineer / Scientist - Storage for LLM

ByteDance

圣何塞

Authors to Follow