🤖 AI Summary
This paper addresses enhancing the expressive power of Transformer models without increasing model parameters. Method: We propose a parallelizable extension mechanism leveraging padding tokens and dynamic depth recursion at inference time (e.g., O(logᵈn)-depth loops), incorporating average hard attention, masked pre-normalization, polynomial padding, and circuit complexity modeling. Contribution/Results: We establish, for the first time, a rigorous equivalence between padding-augmented Transformers and the circuit complexity classes TC⁰ and TCᵈ. We systematically introduce complete problems and reduction techniques into Transformer theoretical analysis. We prove that base padding Transformers exactly recognize TC⁰; augmenting them with polylogarithmic-depth recursion enables recognition of TCᵈ and ultimately converges to NC—achieving theoretically optimal expressivity while preserving high parallelism and revealing fundamental computational complexity boundaries.
📝 Abstract
Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding converge to precisely the class $mathsf{TC}^0$ of extremely parallelizable problems. While the $mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(log^d n)$ looping on inputs of length $n$ recognize exactly the class $mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, padded transformers converge to the class $mathsf{NC}$, the best that could be expected without losing parallelism (unless $mathsf{NC} = mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought.