Probability Distributions Computed by Hard-Attention Transformers

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing theoretical analyses of Transformer expressivity predominantly focus on formal language recognition (i.e., acceptance/rejection of strings), neglecting their primary real-world use as autoregressive probabilistic language models—a critical conceptual gap. Method: We conduct the first systematic characterization of the class of probability distributions expressible by hard-attention Transformers operating as autoregressive language models, integrating formal language theory, probabilistic modeling, and autoregressive generation frameworks to rigorously analyze their functional expressivity. Contribution/Results: We show that the autoregressive and probabilistic structure fundamentally enhances expressivity—breaking the equivalence between hard-attention Transformers and finite automata established under non-probabilistic settings. We precisely delineate the theoretical limits of such Transformers in modeling probability distributions over sequences and formally establish the intrinsic expressive advantage conferred by the language modeling paradigm over mere recognition tasks.

Technology Category

Application Category

📝 Abstract

Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

Problem

Research questions and friction points this paper is trying to address.

Characterize probability distributions expressible by transformer language models

Analyze how autoregressive design affects transformer expressivity

Investigate probabilistic transformers' deviation from non-probabilistic equivalences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive transformers increase expressivity in language models

Probabilistic transformers break non-probabilistic equivalences

Characterize distributions expressible by transformer language models

🔎 Similar Papers

No similar papers found.