Attention (as Discrete-Time Markov) Chains

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the lack of a unified interpretative and modeling framework for attention mechanisms in Vision Transformers. We propose a reformulation grounded in discrete-time Markov chains: the attention matrix is interpreted as a state transition matrix, enabling unified modeling of fundamental operations—including token selection, aggregation, and averaging. We introduce *indirect attention*, which captures long-range semantic dependencies via multi-step propagation over the attention graph. Crucially, we observe that semantically similar tokens form metastable structures within the attention flow, leading to *TokenRank*—a differentiable, global token importance metric derived from spectral analysis of the transition matrix. The method relies solely on efficient matrix multiplication and eigendecomposition. Experiments demonstrate state-of-the-art performance on zero-shot image segmentation and significant improvements in unconditional image generation quality.

Technology Category

Application Category

📝 Abstract

We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

Problem

Research questions and friction points this paper is trying to address.

Interprets attention matrix as Markov chain

Extends attention modeling to indirect effects

Measures token importance via TokenRank

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interprets attention matrix as Markov chain

Uses matrix multiplication for metastable states

Introduces TokenRank for token importance

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models