RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing supervised fine-tuning methods struggle to enhance large language models’ deep critical reasoning capabilities; the resulting critiques lack reflectiveness and verifiability, hindering effective optimization. To address this, we propose RefCritic—a novel framework that pioneers the integration of reinforcement learning (RL) into long-chain-of-thought critique generation. RefCritic introduces a dual-rule reward mechanism jointly optimizing for solution correctness and the refinement efficacy of a strategy model, enabling deep, actionable evaluation. Evaluated on Qwen and DeepSeek series models across five benchmarks, RefCritic achieves significant gains: +6.8%/+7.2% on AIME25, outperforms stepwise supervised approaches on ProcessBench, and demonstrates strong scalability in majority-voting tasks. Our core contribution lies in transcending the limitations of supervised paradigms by establishing an RL-driven critique framework endowed with reflective and verifiable reasoning capabilities.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models' critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8% and 7.2% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enhancing critique abilities in Large Language Models

Overcoming superficial critiques with reinforcement learning

Generating actionable feedback for model refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with dual rule-based rewards

Long-chain-of-thought critic module RefCritic

Instance-level correctness and refinement accuracies

🔎 Similar Papers

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic