Teaching Language Models to Critique via Reinforcement Learning

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the limitation of large language models (LLMs) in autonomously critiquing and iteratively refining generated code. To this end, we propose CTRL, a reinforcement learning framework that enables unsupervised training of critique models without human annotations. Methodologically, we introduce a generative reward model capable of producing fine-grained, transferable self-feedback, and design a two-stage “critique–revision” iterative reasoning mechanism to mitigate error accumulation. Our key contribution is the first unsupervised critique modeling paradigm, empowering models to generate high-precision critiques while supporting cross-task reward modeling. Experiments across multiple code generation benchmarks demonstrate substantial improvements in pass rates, achieving up to a 106.1% relative performance gain over baseline methods.

Technology Category

Application Category

📝 Abstract

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $ exttt{CTRL}$, a framework for $ exttt{C}$ritic $ exttt{T}$raining via $ exttt{R}$einforcement $ exttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $ exttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLM self-critique via reinforcement learning

Improve code generation without human supervision

Scale iterative critique-revision for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning trains critics

Critics enhance code generation

Iterative critique-revision scales performance

🔎 Similar Papers

No similar papers found.