Lower bounds for one-layer transformers that compute parity

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work investigates the theoretical limits of single-layer Transformers in computing the parity function. By integrating symbolic representation theory, rational function approximation, and a detailed analysis of the self-attention mechanism, the authors establish—for the first time—a linear lower bound on the product of the number of attention heads and the complexity of the post-processing function (either a rational function or a ReLU network): to exactly represent the parity function over input sequences of length $ n $, this product must grow linearly with $ n $. This result not only reveals a fundamental limitation of single-layer Transformers in expressing such functions but also extends the lower bound to ReLU networks with bounded dependency gaps, thereby quantifying the precise relationship between model capacity and task difficulty.

📝 Abstract

This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.

Problem

Research questions and friction points this paper is trying to address.

transformer

parity function

self-attention

lower bound

rational function

Innovation

Methods, ideas, or system contributions that make the work stand out.

transformer lower bounds

parity function

self-attention