ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the high computational cost of doubly stochastic attention during inference—caused by repeated matrix scaling—and the prohibitive training expense of existing non-iterative approaches. The authors propose a “train-then-compile” strategy: during training, Sinkhorn scaling is employed to learn doubly stochastic attention; at inference, the iterative process is replaced by a lightweight learnable mapping that efficiently reconstructs the attention matrix via sliced dual operators and bilateral entropy-regularized c-transforms. This approach uniquely integrates Sinkhorn-based training with iteration-free inference by introducing a parameterized mapping from one-dimensional Kantorovich potentials to query-side dual variables. It achieves state-of-the-art performance on language and vision benchmarks, delivers a 5.3× speedup under frozen-layer settings without accuracy loss, and recovers most of the teacher model’s performance on downstream tasks without retraining.
📝 Abstract
Doubly-stochastic attention has emerged as a transport-based alternative to row-softmax attention, with recent Transformer variants using it to reduce attention sinks and rank collapse while improving performance. In this family, the standard approach is Sinkhorn scaling, which trains more efficiently but still repeats matrix scaling in every inference forward pass. Sliced-transport attention removes the online iteration, but its soft sorting approximation materializes dense tensors for each slice, requiring substantially more training resources than Sinkhorn attention. We introduce ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection, a train-then-compile method that trains the doubly-stochastic layer with Sinkhorn, then replaces the iterative scaling loop at inference with a fixed sliced-dual operator. It learns a lightweight parametric map from exact one-dimensional Kantorovich potentials to the Sinkhorn query-side dual, then reconstructs the attention plan with a two-sided entropic c-transform. Across language and vision benchmarks, ASAP keeps the cheaper training setup and remains highly competitive with recent baselines. In the main frozen-layer benchmark, ASAP is 5.3 faster than the trained Sinkhorn teacher while matching its accuracy; in downstream replacements, ASAP recovers most of the teacher performance without any retraining.
Problem

Research questions and friction points this paper is trying to address.

doubly-stochastic attention
Sinkhorn scaling
inference efficiency
training cost
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

doubly-stochastic attention
Sinkhorn scaling
sliced optimal transport
amortized inference
entropic c-transform
🔎 Similar Papers