Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large language models (LLMs) employing “slow-thinking” reasoning struggle to autonomously generate informative, actionable critiques and iteratively refine solutions in long-chain reasoning tasks. Method: This paper introduces the first systematic self-critique fine-tuning framework. It employs supervised fine-tuning on 1,730 human-constructed, high-quality self-critique samples and integrates a multi-round self-assessment and correction mechanism during inference, endowing models with intrinsic reflective capability and closed-loop optimization. Contribution/Results: On the AIME benchmark, the method boosts pass@1 accuracy from 4.4% to 18.2%, substantially improving solution robustness and output verifiability. Its core contribution lies in internalizing structured self-critique as a fundamental reasoning paradigm—establishing a novel framework for verifiable, iterative, and reliable reasoning in LLMs.

Technology Category

Application Category

📝 Abstract

While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.

Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning of slow-thinking LLMs via self-critique

Improving iterative refinement of prior solutions in LLMs

Boosting performance on reasoning benchmarks through self-critical fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-critical fine-tuning enhances LLM reasoning

Iterative self-critique refines model outputs

Curated dataset enables structured self-improvement

🔎 Similar Papers

No similar papers found.