You Need Better Attention Priors

📅 2026-01-21
📈 Citations: 1
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
Standard attention mechanisms employ a fixed uniform prior, which constrains representational capacity and length generalization while inducing attention collapse. This work reframes attention as an entropy-regularized optimal transport problem and introduces, for the first time, a learnable continuous prior to replace the uniform prior. The proposed approach maintains compatibility with efficient kernels such as FlashAttention while integrating the flexibility of learnable positional embeddings with the extrapolation capabilities of fixed encodings. The resulting GOAT (Generalized Optimal Attention Transport) mechanism effectively mitigates attention collapse and substantially improves representation quality and generalization performance on long-sequence tasks.

Technology Category

Application Category

📝 Abstract
We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOT-based explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.
Problem

Research questions and friction points this paper is trying to address.

attention mechanism
attention sinks
positional encoding
length generalization
uniform prior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Transport
Trainable Prior
Attention Mechanism
Length Generalization
FlashAttention
🔎 Similar Papers
No similar papers found.