🤖 AI Summary
This study investigates the computational limits of single-layer, single-head Transformers by examining their ability to compute the PARITY function. The authors construct a single-layer multi-head Transformer—employing standard softmax attention, length-independent and polynomially bounded position encodings, and no LayerNorm—that exactly solves the PARITY problem under both causal and non-causal masking. This constitutes the first constructive demonstration of such capability within this architectural constraint. Moreover, they rigorously prove that a single-layer, single-head Transformer cannot compute PARITY, thereby establishing a theoretical lower bound on the model’s expressive power. These results provide crucial theoretical insights and a concrete constructive example for understanding the representational capacity of Transformer architectures.
📝 Abstract
The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY -- or more generally -- which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY -- by showing that it cannot be done with only one layer and one head.