From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reasoning-oriented large language models (RLMs) exhibit paradoxical performance degradation under few-shot chain-of-thought (CoT) prompting—even with optimally selected or increasingly numerous in-context examples, accuracy consistently declines. Method: We propose I2S (Insight-to-Strategy), a framework that transforms context examples into reusable, explicit insights via insight distillation and target-guided reasoning generation, thereby converting otherwise distracting examples into general-purpose reasoning assets. We further introduce I2S+, which integrates high-quality reasoning trajectories, test-time sequential processing, and a self-refinement mechanism to enable dynamic reasoning optimization. Results: On multiple benchmarks—including AIME’25 and GPQA—I2S+ significantly outperforms both direct answering and test-time scaling baselines: GPT-4.1 achieves +14.0% accuracy on AIME’25; o1-mini improves by +2.7% on AIME and +1.7% on GPQA. These results validate the efficacy and cross-model generalizability of our approach.

Technology Category

Application Category

📝 Abstract
Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
Problem

Research questions and friction points this paper is trying to address.

Addressing performance degradation in reasoning LLMs with few-shot demonstrations
Identifying semantic misguidance and strategy transfer failure as key issues
Proposing Insight-to-Solve framework to convert demonstrations into reusable insights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Insight-to-Solve sequential test-time procedure for reasoning
Converts demonstrations into reusable explicit insights
Self-refines reasoning traces for coherence and correctness