Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) paradigms for code generation focus solely on response production and lack mechanisms for critical self-reflection on solution correctness. Method: We propose Critical Reinforcement Learning (CRL), the first framework to explicitly model critical judgment—i.e., binary correctness assessment of “problem-solution” pairs—as a learnable RL reward signal. CRL integrates with standard RL in a hybrid training paradigm, endowing models with both code generation and self-verification capabilities. Contribution/Results: On LiveCodeBench, Critique-Coder achieves >60% accuracy, outperforming DeepCoder-14B and GPT-o1; it also attains state-of-the-art performance on the BBEH logical reasoning benchmark, demonstrating strong generalization and cross-task transferability. The core innovation lies in internalizing critical judgment as an optimized reinforcement signal—shifting large code models from merely “generating correct outputs” toward “understanding why outputs are correct.”

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c in { exttt{True}, exttt{False}}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce extsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20% of the standard RL data with CRL data. We fine-tune multiple models ( extsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that extsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our extsc{Critique-Coder-8B} can reach over 60% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, extsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhancing coder models through critique reinforcement learning methods
Improving code generation and reasoning via hybrid RL-CRL training
Developing transferable critique abilities across diverse reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Critique Reinforcement Learning for model critique generation
Combines standard RL with CRL using hybrid training data
Enhances code generation and general reasoning capabilities
🔎 Similar Papers
No similar papers found.