LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Standard softmax attention in Transformers exhibits quadratic computational complexity and suffers from uneven token participation due to row-wise normalization, impairing robustness and information flow. While doubly stochastic attention alleviates these issues, existing approaches incur high computational overhead and lack scalability. This paper proposes LOTFormer—a novel attention mechanism achieving both linear complexity and strict double stochasticity. Its core is an entropy-regularized optimal transport framework with low-rank constraints, where attention is modeled as a two-stage coupling decomposition: query → pivot → key. Leveraging learnable, compactly supported pivot measures, LOTFormer computes doubly stochastic attention mappings in O(nr) time, where r ≪ n is the effective rank. To our knowledge, it is the first method guaranteeing strict double stochasticity while maintaining linear-time efficiency. On the Long Range Arena benchmark, LOTFormer achieves state-of-the-art performance, significantly outperforming existing linear-time and optimal-transport-based attention models.

Technology Category

Application Category

📝 Abstract

Transformers have proven highly effective across a wide range of modalities. However, the quadratic complexity of the standard softmax attention mechanism poses a fundamental barrier to scaling them to long context windows. A large body of work addresses this with linear attention, which reformulates attention as a kernel function and approximates it with finite feature maps to achieve linear-time computation. Orthogonal to computational scaling, most attention mechanisms -- both quadratic and linear -- produce row-normalized maps that can over-focus on a few tokens, degrading robustness and information flow. Enforcing doubly-stochastic attention alleviates this by balancing token participation across rows and columns, but existing doubly-stochastic attention mechanisms typically introduce substantial overhead, undermining scalability. We propose LOTFormer, a principled attention mechanism that is simultaneously linear-time and doubly-stochastic. Our approach exploits the connection between attention maps and transportation plans between query and key measures. The central idea is to constrain the transport plan to be low-rank by conditioning it on a learnable pivot measure with small support. Concretely, we solve two entropic optimal transport problems (queries $ o$ pivot and pivot $ o$ keys) and compose them into a conditional (glued) coupling. This yields an attention matrix that is provably doubly-stochastic, has rank at most $r ll n$, and applies to values in $O(nr)$ time without forming the full $n imes n$ map. The pivot locations and masses are learned end-to-end. Empirically, LOTFormer achieves state-of-the-art results on the Long Range Arena benchmark, surpassing prior linear and transport-based attention methods in both accuracy and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of attention for long sequences

Addresses over-focusing in attention via doubly-stochastic constraints

Achieves linear-time computation with low-rank optimal transport

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank optimal transport enables linear-time attention

Doubly-stochastic attention balances token participation

Learned pivot measure creates conditional glued coupling

🔎 Similar Papers

Unifying Linear-Time Attention via Latent Probabilistic Modelling