Parity, Sensitivity, and Transformers

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the computational limits of single-layer, single-head Transformers by examining their ability to compute the PARITY function. The authors construct a single-layer multi-head Transformer—employing standard softmax attention, length-independent and polynomially bounded position encodings, and no LayerNorm—that exactly solves the PARITY problem under both causal and non-causal masking. This constitutes the first constructive demonstration of such capability within this architectural constraint. Moreover, they rigorously prove that a single-layer, single-head Transformer cannot compute PARITY, thereby establishing a theoretical lower bound on the model’s expressive power. These results provide crucial theoretical insights and a concrete constructive example for understanding the representational capacity of Transformer architectures.

Technology Category

Application Category

📝 Abstract
The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY -- or more generally -- which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY -- by showing that it cannot be done with only one layer and one head.
Problem

Research questions and friction points this paper is trying to address.

Parity
Transformers
Computational Complexity
Attention Mechanism
Lower Bound
Innovation

Methods, ideas, or system contributions that make the work stand out.

transformer
PARITY
lower bound
positional encoding
causal masking
🔎 Similar Papers
No similar papers found.