Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

📅 2025-11-10

📈 Citations: 2

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work investigates the length generalization capability of Transformer models in chain-of-thought (CoT) reasoning, specifically their ability to generalize short-chain reasoning patterns to longer chains in synthetic state-tracking tasks. Methodologically, we introduce an algebraically structured modeling framework, integrating gradient descent training with recursive self-training to explicitly extend reasoning depth. Theoretically, we establish for the first time that constant-depth Transformers can learn NC¹-complete problems and identify “attention focusing”—a mechanism wherein attention heads concentrate on critical intermediate states—as the key enabler of long-chain generalization. Experimental results demonstrate substantial improvements in robust reasoning performance on unseen chain lengths. Our core contributions are: (i) identifying attention focusing as the fundamental mechanism underlying CoT length generalization; and (ii) providing the first CoT length extrapolation solution with rigorous theoretical guarantees and a scalable, trainable paradigm.

Technology Category

Application Category

📝 Abstract

The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a recursive self-training scheme can progressively extend the range of solvable problem lengths. To our knowledge, we provide the first optimization guarantee that constant-depth transformers provably learn $mathsf{NC}^1$-complete problems with CoT, significantly going beyond prior art confined in $mathsf{TC}^0$, unless the widely held conjecture $mathsf{TC}^0 eq mathsf{NC}^1$ fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.

Problem

Research questions and friction points this paper is trying to address.

Analyzing transformers' ability to generalize reasoning patterns to longer chain-of-thought sequences

Proving how attention concentration enables length generalization in state-tracking problems

Demonstrating transformers can learn complex reasoning tasks beyond previous computational limits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers learn chain-of-thought via attention concentration

Recursive self-training extends reasoning length capability

Proven optimization guarantee for transformers solving NC1 problems

🔎 Similar Papers

Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis