Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the bottleneck in critic model training—its reliance on strong supervised signals—by proposing an online reinforcement learning (RL) framework that eliminates the need for stronger external supervisors. Methodologically, it first identifies and analyzes the degradation of critic discriminability when relying solely on indirect rewards; it then introduces a two-stage RL strategy: (1) prioritizing discriminability enhancement, followed by (2) joint optimization of helpfulness and discriminability. The framework incorporates rule-based direct rewards, indirect feedback rewards, and regularization to enable co-optimization of critic and generator models. Experiments across diverse tasks and models demonstrate consistent improvements: Qwen2.5-7B achieves +9.02% accuracy on in-domain reasoning tasks and +5.70% on cross-domain tasks, validating the effectiveness and generalizability of the approach.

Technology Category

Application Category

📝 Abstract
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
Problem

Research questions and friction points this paper is trying to address.

Training critiquing models without stronger supervision
Improving critic discriminability and helpfulness simultaneously
Enhancing complex reasoning via two-stage reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage reinforcement learning for critiquing models
Direct rule-based rewards enhance critic discriminability
Indirect actor refinement rewards improve critic helpfulness
🔎 Similar Papers
No similar papers found.
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
J
Jixuan Huang
Fudan University
X
Xin Guo
Fudan University
B
Boyang Hong
Fudan University
D
Dingwen Yang
Fudan University
Xiaoran Fan
Xiaoran Fan
Fudan University
S
Shuo Li
Fudan University
Zehui Chen
Zehui Chen
USTC
J
Junjie Ye
Fudan University
S
Siyu Yuan
Fudan University
Zhengyin Du
Zhengyin Du
ByteDance Seed
Large Language ModelMulti-modal Learning
Xuesong Yao
Xuesong Yao
Master of Mechanics, Peking University
Machine LearningLarge language model
Y
Yufei Xu
ByteDance Seed
Jiecao Chen
Jiecao Chen
Bytedance Seed
LLMreasoningagenttool usememory
R
Rui Zheng
Fudan University
T
Tao Gui
Fudan University
Q
Qi Zhang
Fudan University
X
Xuanjing Huang
Fudan University