🤖 AI Summary
Existing theoretical analyses of Transformer expressivity predominantly focus on formal language recognition (i.e., acceptance/rejection of strings), neglecting their primary real-world use as autoregressive probabilistic language models—a critical conceptual gap.
Method: We conduct the first systematic characterization of the class of probability distributions expressible by hard-attention Transformers operating as autoregressive language models, integrating formal language theory, probabilistic modeling, and autoregressive generation frameworks to rigorously analyze their functional expressivity.
Contribution/Results: We show that the autoregressive and probabilistic structure fundamentally enhances expressivity—breaking the equivalence between hard-attention Transformers and finite automata established under non-probabilistic settings. We precisely delineate the theoretical limits of such Transformers in modeling probability distributions over sequences and formally establish the intrinsic expressive advantage conferred by the language modeling paradigm over mere recognition tasks.
📝 Abstract
Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.