LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

πŸ“… 2025-09-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Standard softmax attention in Transformers exhibits quadratic computational complexity and suffers from uneven token participation due to row-wise normalization, impairing robustness and information flow. While doubly stochastic attention alleviates these issues, existing approaches incur high computational overhead and lack scalability. This paper proposes LOTFormerβ€”a novel attention mechanism achieving both linear complexity and strict double stochasticity. Its core is an entropy-regularized optimal transport framework with low-rank constraints, where attention is modeled as a two-stage coupling decomposition: query β†’ pivot β†’ key. Leveraging learnable, compactly supported pivot measures, LOTFormer computes doubly stochastic attention mappings in O(nr) time, where r β‰ͺ n is the effective rank. To our knowledge, it is the first method guaranteeing strict double stochasticity while maintaining linear-time efficiency. On the Long Range Arena benchmark, LOTFormer achieves state-of-the-art performance, significantly outperforming existing linear-time and optimal-transport-based attention models.

Technology Category

Application Category

πŸ“ Abstract
Transformers have proven highly effective across a wide range of modalities. However, the quadratic complexity of the standard softmax attention mechanism poses a fundamental barrier to scaling them to long context windows. A large body of work addresses this with linear attention, which reformulates attention as a kernel function and approximates it with finite feature maps to achieve linear-time computation. Orthogonal to computational scaling, most attention mechanisms -- both quadratic and linear -- produce row-normalized maps that can over-focus on a few tokens, degrading robustness and information flow. Enforcing doubly-stochastic attention alleviates this by balancing token participation across rows and columns, but existing doubly-stochastic attention mechanisms typically introduce substantial overhead, undermining scalability. We propose LOTFormer, a principled attention mechanism that is simultaneously linear-time and doubly-stochastic. Our approach exploits the connection between attention maps and transportation plans between query and key measures. The central idea is to constrain the transport plan to be low-rank by conditioning it on a learnable pivot measure with small support. Concretely, we solve two entropic optimal transport problems (queries $ o$ pivot and pivot $ o$ keys) and compose them into a conditional (glued) coupling. This yields an attention matrix that is provably doubly-stochastic, has rank at most $r ll n$, and applies to values in $O(nr)$ time without forming the full $n imes n$ map. The pivot locations and masses are learned end-to-end. Empirically, LOTFormer achieves state-of-the-art results on the Long Range Arena benchmark, surpassing prior linear and transport-based attention methods in both accuracy and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of attention for long sequences
Addresses over-focusing in attention via doubly-stochastic constraints
Achieves linear-time computation with low-rank optimal transport
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank optimal transport enables linear-time attention
Doubly-stochastic attention balances token participation
Learned pivot measure creates conditional glued coupling
πŸ”Ž Similar Papers
No similar papers found.