🤖 AI Summary
To address the challenge of defining reward signals in real-world tasks—where balancing objective and subjective evaluation criteria remains difficult—this paper proposes using structured, checklist-style rubrics as interpretable and decomposable reward signals for reinforcement learning–based alignment of language models. It introduces explicit rubrics into the RLHF framework for the first time, replacing opaque reward models to mitigate spurious correlation and enhance interpretability. Building upon GRPO, we design an end-to-end training pipeline enabling compact judge models to accurately capture human preferences. On HealthBench-1k, our approach achieves up to a 28% improvement over Likert-scale–based methods, matching or surpassing the performance of expert-reference–based rewards, while demonstrating robustness across diverse model scales.
📝 Abstract
Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $ extbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.