On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work elucidates the theoretical mechanisms underlying the success and failure of Chain-of-Thought (CoT) reasoning, focusing on the trade-off between its benefits and error accumulation. By constructing a learning-theoretic framework, CoT is modeled as an interaction between an answer mapping and an autoregressive rule for generating intermediate reasoning steps. The study introduces the first tight decomposition of reasoning risk into two components: Oracle Trajectory Risk (OTR), which captures the benefit of idealized reasoning, and Trajectory Mismatch Risk (TMR), which quantifies the cost of deviation from such an oracle. Theoretical analysis reveals that under unstable conditions, TMR can grow unboundedly, whereas under stable conditions, error propagation is governed by a precise amplification factor, yielding bounded, linear, or exponential growth regimes. This characterization fully delineates the validity boundary of CoT and identifies key factors governing the benefit–cost trade-off, providing a theoretical foundation for designing reliable reasoning systems.

📝 Abstract

We develop a learning-theoretic framework for understanding Chain of Thought (CoT). We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. We then show that this cost is unavoidable without structure: if any one of the loss, the hypothesis answer map, or the chain rule lacks stability, the TMR can be arbitrarily large even when the OTR is zero and the hypothesis is uniformly close to the ground truth. Conversely, under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear, and exponential error-growth regimes. Together, these results give a precise theory of when CoT helps, when it hurts, and what controls the transition between the two.

Problem

Research questions and friction points this paper is trying to address.

Chain of Thought

reasoning risk

error accumulation

stability

domain adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Thought

learning theory

risk decomposition