The Origin of Self-Attention: From Pairwise Affinity Matrices to Transformers

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This paper addresses the theoretical origins of self-attention, arguing that it fundamentally implements information flow modulation via pairwise affinity matrices (A), falling under a broader “affinity computation” paradigm. Method: By systematically analyzing affinity matrix usage across computer vision, NLP, and graph learning, the authors model Transformer attention as a special case of infinite-feature selection (Inf-FS) under single-hop propagation, unifying both frameworks through a multi-hop information propagation perspective. Their approach integrates affinity matrix modeling, dynamic token similarity computation, and graph-structured reasoning. Contribution/Results: First, it establishes the first rigorous theoretical connection between self-attention and classical feature selection. Second, it proposes a cross-domain unified framework that exposes shared mathematical foundations across diverse models. Third, it introduces a novel design paradigm for interpretable and scalable attention mechanisms grounded in affinity-based computation.

Technology Category

Application Category

📝 Abstract

The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of Inf-FS: it uses a single-hop affinity computation where A is dynamically built from token similarities. We argue that the underlying structure, reasoning over pairwise relationships, is preserved across both approaches, and the key differences lie in how the affinity matrix is defined and applied. By situating self-attention within the broader paradigm of affinity-based computation, we unify several strands of machine learning research and highlight a common mathematical foundation that underpins diverse models and tasks.

Problem

Research questions and friction points this paper is trying to address.

Traces origins of self-attention in affinity matrices across domains

Compares Transformer's dot-product with Inf-FS multi-hop affinity learning

Unifies self-attention and affinity-based computation under shared principles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns pairwise affinity matrices for information flow

Generalizes affinity-based weighting via Infinite Feature Selection

Uses multi-hop propagation over affinity graphs

🔎 Similar Papers

Dissecting Query-Key Interaction in Vision Transformers