DeepCritic: Deliberate Critique with Large Language Models

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing large language models (LLMs) exhibit weak capability in step-level mathematical reasoning critique, providing superficial feedback and insufficient corrective guidance. Method: We propose a two-stage framework: (1) generating long-horizon, step-wise critique seeds augmented with self-correction; and (2) refining critiques via Monte Carlo automated annotation coupled with correctness-estimation–driven reinforcement learning (RL), enabling deep verification and iterative reflection. Our model is built upon Qwen2.5-7B/72B-Instruct, integrating supervised fine-tuning, PRM800K human annotations, and Monte Carlo RL. Results: Experiments demonstrate substantial improvements over same-size DeepSeek-R1-distill and GPT-4o across multiple error-identification benchmarks. Our approach yields finer-grained, more actionable feedback and significantly enhances the base model’s ability to rectify erroneous reasoning steps.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

Problem

Research questions and friction points this paper is trying to address.

Enhancing math critique ability of LLMs

Addressing shallow critiques in LLM feedback

Developing deliberate step-wise critique frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework enhances math critique ability

Supervised fine-tuning with multi-perspective verification critiques

Reinforcement learning boosts critique accuracy and detail

🔎 Similar Papers

No similar papers found.

Authors to Follow