To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

📅 2024-09-18

🏛️ arXiv.org

📈 Citations: 79

✨ Influential: 5

career value

152K/year

🤖 AI Summary

This study systematically investigates the task boundaries of chain-of-thought (CoT) prompting for enhancing large language model (LLM) performance. Through a quantitative meta-analysis and controlled experiments across 14 models and 20 datasets, we find CoT gains are highly concentrated in mathematical and symbolic reasoning tasks (+12.3% average improvement), yet negligible in commonsense reasoning and language understanding (+0.8%). We propose that CoT’s core mechanism is augmenting symbolic execution—not general-purpose reasoning—and demonstrate that its efficacy is strongly predicted by symbol-triggered behaviors (e.g., equality signs). A planning-execution decoupling analysis further reveals inherent computational paradigm limitations. Consequently, we advocate selective CoT activation to balance performance gains against inference cost, and call for novel intermediate computation architectures integrating explicit symbolic solvers—empirically shown to substantially outperform CoT.

Technology Category

Application Category

📝 Abstract

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluates when Chain-of-Thought (CoT) improves LLM task performance

Identifies math and logic tasks as primary beneficiaries of CoT

Suggests moving beyond prompt-based CoT for broader LLM applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantitative meta-analysis of 100 CoT papers

Selective CoT application saves inference costs

Beyond prompt-based CoT to new paradigms

🔎 Similar Papers

No similar papers found.