Exact Expressive Power of Transformers with Padding

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses enhancing the expressive power of Transformer models without increasing model parameters. Method: We propose a parallelizable extension mechanism leveraging padding tokens and dynamic depth recursion at inference time (e.g., O(logᵈn)-depth loops), incorporating average hard attention, masked pre-normalization, polynomial padding, and circuit complexity modeling. Contribution/Results: We establish, for the first time, a rigorous equivalence between padding-augmented Transformers and the circuit complexity classes TC⁰ and TCᵈ. We systematically introduce complete problems and reduction techniques into Transformer theoretical analysis. We prove that base padding Transformers exactly recognize TC⁰; augmenting them with polylogarithmic-depth recursion enables recognition of TCᵈ and ultimately converges to NC—achieving theoretically optimal expressivity while preserving high parallelism and revealing fundamental computational complexity boundaries.

Technology Category

Application Category

📝 Abstract
Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding converge to precisely the class $mathsf{TC}^0$ of extremely parallelizable problems. While the $mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(log^d n)$ looping on inputs of length $n$ recognize exactly the class $mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, padded transformers converge to the class $mathsf{NC}$, the best that could be expected without losing parallelism (unless $mathsf{NC} = mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought.
Problem

Research questions and friction points this paper is trying to address.

Expanding transformer expressive power without adding parameters
Analyzing padded transformers' convergence to parallelizable problem classes
Exploring padding and looping as parallel alternatives to chain of thought
Innovation

Methods, ideas, or system contributions that make the work stand out.

Averaging-hard-attention transformers with polynomial padding
Padding and looping for parallelizable compute
Padded transformers recognize class TC^d with looping
🔎 Similar Papers
No similar papers found.