🤖 AI Summary
This paper addresses the theoretical origins of self-attention, arguing that it fundamentally implements information flow modulation via pairwise affinity matrices (A), falling under a broader “affinity computation” paradigm. Method: By systematically analyzing affinity matrix usage across computer vision, NLP, and graph learning, the authors model Transformer attention as a special case of infinite-feature selection (Inf-FS) under single-hop propagation, unifying both frameworks through a multi-hop information propagation perspective. Their approach integrates affinity matrix modeling, dynamic token similarity computation, and graph-structured reasoning. Contribution/Results: First, it establishes the first rigorous theoretical connection between self-attention and classical feature selection. Second, it proposes a cross-domain unified framework that exposes shared mathematical foundations across diverse models. Third, it introduces a novel design paradigm for interpretable and scalable attention mechanisms grounded in affinity-based computation.
📝 Abstract
The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of Inf-FS: it uses a single-hop affinity computation where A is dynamically built from token similarities. We argue that the underlying structure, reasoning over pairwise relationships, is preserved across both approaches, and the key differences lie in how the affinity matrix is defined and applied. By situating self-attention within the broader paradigm of affinity-based computation, we unify several strands of machine learning research and highlight a common mathematical foundation that underpins diverse models and tasks.