Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates the efficacy of chain-of-thought reasoning in enhancing model performance and its limitations in length generalization, with a particular focus on logical reasoning tasks requiring longer proof sequences. To this end, we introduce PITA, a large-scale propositional logic dataset, and propose quantitative metrics—task depth and task breadth—to systematically evaluate models with and without reasoning traces across diverse task topologies. Experimental results demonstrate that chain-of-thought models significantly outperform baselines on broad but shallow tasks, yet exhibit marked performance degradation on narrow but deep ones. Our findings reveal a fundamental limitation of current reasoning paradigms in deep generalization, supported by an interpretable theoretical framework and validated through cross-benchmark evaluations, including syllogism composition tasks.

Technology Category

Application Category

📝 Abstract

Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.

Problem

Research questions and friction points this paper is trying to address.

reasoning traces

length generalization

task depth

task breadth

neural reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning traces

length generalization

task depth