Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in scalar-reward reinforcement learning (RL) for large language models (LLMs)—performance stagnation, inefficient self-reflection, and persistent failure—this paper proposes the first online dual-modality RL framework integrating natural language critique with scalar rewards. Methodologically, it extends GRPO to jointly optimize language critique interpretation, scalar reward modeling, and controllable exploration, thereby improving both initial response generation and critique-guided refinement. Key contributions include: (i) the first integration of interpretable natural language critique into online RL policy optimization; and (ii) the identification of a critical cognitive bias: “high-entropy/long responses ≠ effective exploration.” Evaluated on eight mathematical, STEM, and general reasoning benchmarks, the approach achieves average pass@1 gains of 4.5–5.0% over strong baselines—including supervised fine-tuning, scalar-only RL, and expert-demonstration-augmented methods.

Technology Category

Application Category

📝 Abstract
Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.
Problem

Research questions and friction points this paper is trying to address.

Overcoming performance plateaus in RL-finetuned LLMs
Enhancing self-reflection with natural language feedback
Addressing persistent failures in complex reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates natural language and numerical feedback
Online RL framework for policy optimization
Enhances LLM reasoning with critique-guided refinements
🔎 Similar Papers
No similar papers found.
Xiaoying Zhang
Xiaoying Zhang
Bytedance Inc.
H
Hao Sun
University of Cambridge
Yipeng Zhang
Yipeng Zhang
Tsinghua University
Kaituo Feng
Kaituo Feng
MMLab, CUHK
Multimodal LLMsMachine Learning
Chaochao Lu
Chaochao Lu
Shanghai AI Laboratory
Causal AI
C
Chao Yang
Shanghai Artificial Intelligence Laboratory
H
Helen Meng
The Chinese University of Hong Kong, HCCL