Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the lack of theoretical foundations for scaled dot-product attention (SDPA). Methodologically, it establishes—via first principles—that SDPA’s forward pass analytically solves a one-sided entropy-regularized optimal transport (EOT) problem: maximizing key-value similarity while preserving maximal entropy in the query distribution. It further shows that SDPA’s backward-pass gradients correspond exactly to a variance-reduced advantage policy gradient, unifying inference and learning under a common optimization principle. By endowing the attention manifold with a Fisher information metric, the framework introduces an information-geometric, manifold-aware learning mechanism. The contributions are threefold: (i) a rigorous mathematical characterization of SDPA as an EOT solution; (ii) a unified variational interpretation linking attention to reinforcement learning via entropy-regularized policy optimization; and (iii) a geometric learning paradigm grounded in information geometry. This bridges deep learning, optimal transport, information geometry, and reinforcement learning, revealing intrinsic rational decision-making and optimal control structures within attention mechanisms.

Technology Category

Application Category

📝 Abstract

The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.

Problem

Research questions and friction points this paper is trying to address.

Justifies SDPA via entropic optimal transport

Links backpropagation to advantage-based policy gradients

Reveals SDPA's information geometry and learning dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

SDPA solves one-sided Entropic Optimal Transport

Backpropagation equals advantage-based policy gradient

EOT induces Fisher geometry for learning gradients

🔎 Similar Papers

Cost-Effective Attention Mechanisms for Low Resource Settings: Necessity&Sufficiency of Linear Transformations