Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the challenge of defining reward signals in real-world tasks—where balancing objective and subjective evaluation criteria remains difficult—this paper proposes using structured, checklist-style rubrics as interpretable and decomposable reward signals for reinforcement learning–based alignment of language models. It introduces explicit rubrics into the RLHF framework for the first time, replacing opaque reward models to mitigate spurious correlation and enhance interpretability. Building upon GRPO, we design an end-to-end training pipeline enabling compact judge models to accurately capture human preferences. On HealthBench-1k, our approach achieves up to a 28% improvement over Likert-scale–based methods, matching or surpassing the performance of expert-reference–based rewards, while demonstrating robustness across diverse model scales.

Technology Category

Application Category

📝 Abstract

Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $ extbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.

Problem

Research questions and friction points this paper is trying to address.

Balancing objective and subjective evaluation criteria in RLVR

Lack of reliable reward signals for post-training language models

Opaque reward functions in traditional preference-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses structured rubrics as interpretable rewards

Improves alignment with human preferences

Enhances performance across model scales

🔎 Similar Papers

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning