Lower bounds for one-layer transformers that compute parity

๐Ÿ“… 2026-05-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

210K/year
๐Ÿค– AI Summary
This work investigates the theoretical limits of single-layer Transformers in computing the parity function. By integrating symbolic representation theory, rational function approximation, and a detailed analysis of the self-attention mechanism, the authors establishโ€”for the first timeโ€”a linear lower bound on the product of the number of attention heads and the complexity of the post-processing function (either a rational function or a ReLU network): to exactly represent the parity function over input sequences of length \( n \), this product must grow linearly with \( n \). This result not only reveals a fundamental limitation of single-layer Transformers in expressing such functions but also extends the lower bound to ReLU networks with bounded dependency gaps, thereby quantifying the precise relationship between model capacity and task difficulty.
๐Ÿ“ Abstract
This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.
Problem

Research questions and friction points this paper is trying to address.

transformer
parity function
self-attention
lower bound
rational function
Innovation

Methods, ideas, or system contributions that make the work stand out.

transformer lower bounds
parity function
self-attention
rational approximation
sign-representation