Characterizing the Expressivity of Local Attention in Transformers

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
While local attention has been shown to enhance Transformer performance, its expressive advantages lack theoretical justification. This work addresses this gap by adopting an automata-theoretic perspective and, for the first time, formally demonstrates—through the lens of formal language theory and linear temporal logic—that local attention introduces novel temporal logic operators that complement those of global attention. Together, they jointly expand the class of regular languages recognizable by the model. Building upon a hybrid attention Transformer architecture, both theoretical analysis and empirical experiments confirm that integrating local and global attention yields significant improvements over global-only baselines in formal language recognition and natural language modeling tasks.
📝 Abstract
The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global--local transformers outperform their global-only counterparts.
Problem

Research questions and friction points this paper is trying to address.

local attention
transformer
expressivity
temporal logic
language modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

local attention
expressivity
temporal logic
transformer
formal languages