🤖 AI Summary
This work identifies a pervasive performance degradation of chain-of-thought (CoT) prompting in pattern-based in-context learning (ICL), challenging its assumed universal effectiveness. Method: The authors establish an explicit–implicit reasoning duality framework, demonstrating that CoT’s enforced explicit stepwise derivation frequently introduces errors, while models’ implicit compensation mechanisms—though partially mitigating bias—undermine both reasoning efficiency and accuracy. Through systematic empirical evaluation across 16 state-of-the-art LLMs and 9 pattern-based ICL benchmarks, they conduct controlled ablation studies and attribution diagnostics. Contribution/Results: Results consistently show CoT underperforms direct answering across model scales and task complexities; increasing reasoning steps fails to improve outcomes and exacerbates computational overhead. This study provides the first cognitive-mechanism-level explanation for CoT failure in pattern-based ICL, offering theoretical insights and practical guidance for prompt engineering.
📝 Abstract
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.