π€ AI Summary
Standard softmax attention in Transformers exhibits quadratic computational complexity and suffers from uneven token participation due to row-wise normalization, impairing robustness and information flow. While doubly stochastic attention alleviates these issues, existing approaches incur high computational overhead and lack scalability. This paper proposes LOTFormerβa novel attention mechanism achieving both linear complexity and strict double stochasticity. Its core is an entropy-regularized optimal transport framework with low-rank constraints, where attention is modeled as a two-stage coupling decomposition: query β pivot β key. Leveraging learnable, compactly supported pivot measures, LOTFormer computes doubly stochastic attention mappings in O(nr) time, where r βͺ n is the effective rank. To our knowledge, it is the first method guaranteeing strict double stochasticity while maintaining linear-time efficiency. On the Long Range Arena benchmark, LOTFormer achieves state-of-the-art performance, significantly outperforming existing linear-time and optimal-transport-based attention models.
π Abstract
Transformers have proven highly effective across a wide range of modalities. However, the quadratic complexity of the standard softmax attention mechanism poses a fundamental barrier to scaling them to long context windows. A large body of work addresses this with linear attention, which reformulates attention as a kernel function and approximates it with finite feature maps to achieve linear-time computation. Orthogonal to computational scaling, most attention mechanisms -- both quadratic and linear -- produce row-normalized maps that can over-focus on a few tokens, degrading robustness and information flow. Enforcing doubly-stochastic attention alleviates this by balancing token participation across rows and columns, but existing doubly-stochastic attention mechanisms typically introduce substantial overhead, undermining scalability. We propose LOTFormer, a principled attention mechanism that is simultaneously linear-time and doubly-stochastic. Our approach exploits the connection between attention maps and transportation plans between query and key measures. The central idea is to constrain the transport plan to be low-rank by conditioning it on a learnable pivot measure with small support. Concretely, we solve two entropic optimal transport problems (queries $ o$ pivot and pivot $ o$ keys) and compose them into a conditional (glued) coupling. This yields an attention matrix that is provably doubly-stochastic, has rank at most $r ll n$, and applies to values in $O(nr)$ time without forming the full $n imes n$ map. The pivot locations and masses are learned end-to-end. Empirically, LOTFormer achieves state-of-the-art results on the Long Range Arena benchmark, surpassing prior linear and transport-based attention methods in both accuracy and efficiency.