Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the necessity and mechanistic role of chain-of-thought (CoT) prompting in enhancing the mathematical reasoning capabilities of reasoning-oriented large language models (RLLMs). Through systematic comparisons of zero-shot and few-shot CoT across RLLMs with 1.5B–32B parameters—augmented by attention logits analysis, thought-token modeling, and step-wise statistical profiling—we find: (1) CoT substantially improves performance, with larger gains on complex tasks for bigger models and relatively greater benefits for smaller models on simpler tasks; (2) CoT markedly suppresses over-reflection—by up to 90%—and mitigates overfitting to reflection-specific vocabulary; (3) One-shot CoT consistently outperforms few-shot CoT, effectively calibrating the distribution of reasoning steps and reducing redundant inference. Collectively, these results demonstrate that CoT is not merely an effective prompting strategy but a critical intervention for regulating internal reasoning dynamics in RLLMs.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question:"Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?"In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs' performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs' overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs' performance through appropriate prompting strategies.
Problem

Research questions and friction points this paper is trying to address.

Evaluating CoT prompting impact on RLLMs' reasoning performance
Analyzing RLLMs' overfitting to reflection-related words
Optimizing RLLMs' performance with one-shot CoT strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

CoT prompting enhances RLLMs' reasoning performance
Controls thinking tokens distribution, reduces reflections
One-shot CoT outperforms Few-shot CoT approaches
🔎 Similar Papers
No similar papers found.