ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
While large language model agents can correct errors using external critiques, they struggle to internalize critique capabilities, and fixed critics fail to continually improve feedback quality. To address this, this work proposes the ICRL framework, which jointly trains a shared-backbone solver and critic via reinforcement learning to internalize critique ability. The method introduces distribution-calibrated reweighting to align behavioral distributions with and without critiques and designs a role-grouped advantage estimation strategy to stabilize joint optimization. Experiments demonstrate that ICRL outperforms GRPO by average margins of 6.4 and 7.0 points on agent-based and mathematical reasoning tasks, respectively. Moreover, the learned 8B-parameter critic matches the performance of a 32B-parameter model while incurring significantly lower inference overhead.
📝 Abstract
Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.
Problem

Research questions and friction points this paper is trying to address.

self-critique
internalization
reinforcement learning
language agents
iterative improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

internalization
self-critique
reinforcement learning
distribution calibration
joint optimization
🔎 Similar Papers
No similar papers found.