Probability Distributions Computed by Hard-Attention Transformers

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing theoretical analyses of Transformer expressivity predominantly focus on formal language recognition (i.e., acceptance/rejection of strings), neglecting their primary real-world use as autoregressive probabilistic language models—a critical conceptual gap. Method: We conduct the first systematic characterization of the class of probability distributions expressible by hard-attention Transformers operating as autoregressive language models, integrating formal language theory, probabilistic modeling, and autoregressive generation frameworks to rigorously analyze their functional expressivity. Contribution/Results: We show that the autoregressive and probabilistic structure fundamentally enhances expressivity—breaking the equivalence between hard-attention Transformers and finite automata established under non-probabilistic settings. We precisely delineate the theoretical limits of such Transformers in modeling probability distributions over sequences and formally establish the intrinsic expressive advantage conferred by the language modeling paradigm over mere recognition tasks.

Technology Category

Application Category

📝 Abstract
Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
Problem

Research questions and friction points this paper is trying to address.

Characterize probability distributions expressible by transformer language models
Analyze how autoregressive design affects transformer expressivity
Investigate probabilistic transformers' deviation from non-probabilistic equivalences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive transformers increase expressivity in language models
Probabilistic transformers break non-probabilistic equivalences
Characterize distributions expressible by transformer language models
🔎 Similar Papers
No similar papers found.