🤖 AI Summary
Standard attention mechanisms suffer from quadratic computational and memory complexity in sequence length due to pairwise interaction modeling. Linear-time alternatives—such as sparsity-based approximations or state space models (SSMs)—reduce complexity but compromise expressive capacity or long-range dependency modeling flexibility. This paper introduces TreeAttention, the first attention mechanism leveraging efficient inversion of tree-structured matrices for attention computation, synergistically combining sparsity with cyclic dependency modeling. By employing structured linear transformations and hierarchical tree-recursive computation, TreeAttention achieves near-exact inversion for sequence-to-sequence mappings, preserving strong representational power while reducing complexity to nearly linear. Extensive experiments demonstrate that TreeAttention significantly outperforms standard attention and leading linear-time methods on long-sequence tasks, achieving a superior balance between efficiency and modeling capability.
📝 Abstract
Attention layers apply a sequence-to-sequence mapping whose parameters depend on the pairwise interactions of the input elements. However, without any structural assumptions, memory and compute scale quadratically with the sequence length. The two main ways to mitigate this are to introduce sparsity by ignoring a sufficient amount of pairwise interactions or to introduce recurrent dependence along them, as SSM does. Although both approaches are reasonable, they both have disadvantages. We propose a novel algorithm that combines the advantages of both concepts. Our idea is based on the efficient inversion of tree-structured matrices.