π€ AI Summary
This work addresses the prevalence of redundant steps in chain-of-thought reasoning generated by language models and introduces the formal notion of a βminimal coreββthe smallest subset of reasoning steps sufficient to preserve the original answer or prediction distribution. The authors propose quantitative metrics including compression ratio, redundancy quality, and step necessity, and develop a greedy extraction algorithm grounded in representational geometry, intrinsic dimensionality estimation, cross-model transfer, and theoretical analysis. Empirical evaluation across six reasoning benchmarks reveals that, on average, 46% of steps can be removed while retaining 86% of original answers; the top three critical steps account for 65% of necessity quality. Moreover, minimal cores significantly enhance separation between correct and incorrect reasoning trajectories (+11 points), reduce intrinsic dimensionality (β34%), and achieve an 85% answer retention rate under cross-model transfer.
π Abstract
Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.