Self-Generated Critiques Boost Reward Modeling for Language Models

📅 2024-11-25
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models output only scalar scores, lacking interpretable natural-language critiques—limiting human preference alignment in RLHF. Method: We propose Critic-RM, the first framework to jointly model high-quality textual critiques and scalar rewards without human annotations. It adopts a two-stage paradigm: (i) LLM-generated critiques are filtered for quality; (ii) a multi-task fine-tuning objective—combining reward regression and critique generation—is optimized via contrastive learning–driven selection. Contribution/Results: On multiple benchmarks, Critic-RM improves reward prediction accuracy by 3.7–7.3% over strong baselines. Its generated critiques localize and correct flawed reasoning, boosting downstream reasoning accuracy by 2.5–3.2%. Critic-RM establishes a novel, interpretable, and fully unsupervised paradigm for reward modeling.

Technology Category

Application Category

📝 Abstract
Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
Problem

Research questions and friction points this paper is trying to address.

Improving reward modeling accuracy
Incorporating natural language critiques
Enhancing reasoning accuracy via critiques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generated critiques improve reward models
Two-stage process for critique generation and filtering
Joint fine-tuning enhances reward and critique accuracy
🔎 Similar Papers
2024-01-18International Conference on Machine LearningCitations: 264
Y
Yue Yu
GenAI, Meta; Georgia Institute of Technology
Zhengxing Chen
Zhengxing Chen
Northeastern University
Game AnalyticsMachine LearningData Mining
Aston Zhang
Aston Zhang
OpenAI
Machine LearningLarge Language Models
L
Liang Tan
GenAI, Meta
C
Chenguang Zhu
GenAI, Meta
Richard Yuanzhe Pang
Richard Yuanzhe Pang
Meta, New York University
Natural Language ProcessingMachine Learning
Y
Yundi Qian
GenAI, Meta
X
Xuewei Wang
GenAI, Meta
S
Suchin Gururangan
GenAI, Meta
C
Chao Zhang
Georgia Institute of Technology
M
M. Kambadur
GenAI, Meta
Dhruv Mahajan
Dhruv Mahajan
GenAI, Meta
Rui Hou
Rui Hou
Member of Technical Staff, xAI
Large Language ModelReasoning