Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) employing “slow-thinking” reasoning struggle to autonomously generate informative, actionable critiques and iteratively refine solutions in long-chain reasoning tasks. Method: This paper introduces the first systematic self-critique fine-tuning framework. It employs supervised fine-tuning on 1,730 human-constructed, high-quality self-critique samples and integrates a multi-round self-assessment and correction mechanism during inference, endowing models with intrinsic reflective capability and closed-loop optimization. Contribution/Results: On the AIME benchmark, the method boosts pass@1 accuracy from 4.4% to 18.2%, substantially improving solution robustness and output verifiability. Its core contribution lies in internalizing structured self-critique as a fundamental reasoning paradigm—establishing a novel framework for verifiable, iterative, and reliable reasoning in LLMs.

Technology Category

Application Category

📝 Abstract
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning of slow-thinking LLMs via self-critique
Improving iterative refinement of prior solutions in LLMs
Boosting performance on reasoning benchmarks through self-critical fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-critical fine-tuning enhances LLM reasoning
Iterative self-critique refines model outputs
Curated dataset enables structured self-improvement
🔎 Similar Papers
No similar papers found.
X
Xin Xu
The Hong Kong University of Science and Technology
Tianhao Chen
Tianhao Chen
Phd student, Zhejiang University
Geotechnical engineering
F
Fan Zhang
The Hong Kong University of Science and Technology
Wanlong Liu
Wanlong Liu
University of Electronic Science and Technology of China
LLM ReasoningRAGMedical LLMInformation Extraction
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
A
Ajay Kumar Jaiswal
University of Texas at Austin
Y
Yuchen Yan
Zhejiang University
J
Jishan Hu
The Hong Kong University of Science and Technology
Y
Yang Wang
The Hong Kong University of Science and Technology
H
Hao Chen
The Hong Kong University of Science and Technology
S
Shiwei Liu
University of Oxford
Shizhe Diao
Shizhe Diao
NVIDIA Research
Large Language ModelsNatural Language Processing
Can Yang
Can Yang
Hong Kong University of Science and Technology
Statistical Machine LearningStatistical Genetics and Genomics
L
Lu Yin
University of Surrey