Teaching Language Models to Critique via Reinforcement Learning

๐Ÿ“… 2025-02-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of large language models (LLMs) in autonomously critiquing and iteratively refining generated code. To this end, we propose CTRL, a reinforcement learning framework that enables unsupervised training of critique models without human annotations. Methodologically, we introduce a generative reward model capable of producing fine-grained, transferable self-feedback, and design a two-stage โ€œcritiqueโ€“revisionโ€ iterative reasoning mechanism to mitigate error accumulation. Our key contribution is the first unsupervised critique modeling paradigm, empowering models to generate high-precision critiques while supporting cross-task reward modeling. Experiments across multiple code generation benchmarks demonstrate substantial improvements in pass rates, achieving up to a 106.1% relative performance gain over baseline methods.

Technology Category

Application Category

๐Ÿ“ Abstract
Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $ exttt{CTRL}$, a framework for $ exttt{C}$ritic $ exttt{T}$raining via $ exttt{R}$einforcement $ exttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $ exttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM self-critique via reinforcement learning
Improve code generation without human supervision
Scale iterative critique-revision for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning trains critics
Critics enhance code generation
Iterative critique-revision scales performance
๐Ÿ”Ž Similar Papers
No similar papers found.