ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Self-attention mechanisms often suffer from imbalanced information flow among tokens; existing doubly stochastic constraints rely on computationally expensive iterative Sinkhorn normalization. To address this, we propose ESPFormer—the first architecture to integrate Sliced Optimal Transport (SOT) with Expected Sliced Transport Planning (ESP), yielding a fully parallelizable, non-iterative, and end-to-end differentiable doubly stochastic attention mechanism. We introduce temperature-based soft sorting to ensure differentiability while eliminating Sinkhorn iterations entirely. Evaluated on four diverse tasks—image classification, point cloud recognition, sentiment analysis, and neural machine translation—ESPFormer consistently outperforms strong baselines, achieving significant gains in generalization and robustness. Moreover, it accelerates training by over 3.2× compared to Sinkhorn-based approaches, without compromising accuracy or expressiveness.

Technology Category

Application Category

📝 Abstract

While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces double stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications.

Problem

Research questions and friction points this paper is trying to address.

Improves attention balance in Transformers

Reduces computational cost of attention mechanisms

Enhances performance across diverse applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly-stochastic attention mechanism

Expected Sliced Transport Plans

Temperature-based soft sorting

🔎 Similar Papers

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers