Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of defining reward signals in real-world tasks—where balancing objective and subjective evaluation criteria remains difficult—this paper proposes using structured, checklist-style rubrics as interpretable and decomposable reward signals for reinforcement learning–based alignment of language models. It introduces explicit rubrics into the RLHF framework for the first time, replacing opaque reward models to mitigate spurious correlation and enhance interpretability. Building upon GRPO, we design an end-to-end training pipeline enabling compact judge models to accurately capture human preferences. On HealthBench-1k, our approach achieves up to a 28% improvement over Likert-scale–based methods, matching or surpassing the performance of expert-reference–based rewards, while demonstrating robustness across diverse model scales.

Technology Category

Application Category

📝 Abstract
Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $ extbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.
Problem

Research questions and friction points this paper is trying to address.

Balancing objective and subjective evaluation criteria in RLVR
Lack of reliable reward signals for post-training language models
Opaque reward functions in traditional preference-based methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses structured rubrics as interpretable rewards
Improves alignment with human preferences
Enhances performance across model scales
🔎 Similar Papers
No similar papers found.
A
Anisha Gunjal
Scale AI
A
Anthony Wang
Scale AI
Elaine Lau
Elaine Lau
McGill University, Mila, Scale AI
deep learningreinforcement learningnatural language processing
V
Vaskar Nath
Scale AI
B
Bing Liu
Scale AI
S
Sean Hendryx
Scale AI