๐ค AI Summary
This work investigates the theoretical limits of single-layer Transformers in computing the parity function. By integrating symbolic representation theory, rational function approximation, and a detailed analysis of the self-attention mechanism, the authors establishโfor the first timeโa linear lower bound on the product of the number of attention heads and the complexity of the post-processing function (either a rational function or a ReLU network): to exactly represent the parity function over input sequences of length \( n \), this product must grow linearly with \( n \). This result not only reveals a fundamental limitation of single-layer Transformers in expressing such functions but also extends the lower bound to ReLU networks with bounded dependency gaps, thereby quantifying the precise relationship between model capacity and task difficulty.
๐ Abstract
This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.