Barriers to Universal Reasoning With Transformers (And How to Overcome Them)

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the limited generalization of Transformers in chain-of-thought (CoT) reasoning when extrapolating beyond training-length reasoning traces, which constrains their universal reasoning capacity. Through computational complexity analysis, we establish for the first time that standard Transformers with a fixed vocabulary cannot surpass the TC⁰ complexity class, identifying repeated copying and final retrieval as two fundamental bottlenecks to length generalization. To overcome these limitations, we propose a framework incorporating dynamically expanded vocabularies, signpost tokens, and value-change log encoding, yielding the first Turing machine simulation scheme that generalizes across arbitrary CoT lengths with linear scaling between reasoning length and simulated runtime. Empirical results demonstrate that our approach substantially improves generalization on long-horizon reasoning tasks.

📝 Abstract

Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that -- under standard positional encodings and a finite alphabet -- Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.

Problem

Research questions and friction points this paper is trying to address.

length generalization

Chain-of-Thought

Transformers

Turing completeness

positional encodings

Innovation

Methods, ideas, or system contributions that make the work stand out.

length generalization

Chain-of-Thought

Turing completeness