R3: Robust Rubric-Agnostic Reward Models

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing reward models suffer from three key limitations: poor controllability, weak generalization across tasks and domains, and lack of interpretability—stemming from their reliance on scalar scores derived from predefined, task-specific rubrics. This work introduces the first rubric-agnostic, multidimensional reward modeling framework, which employs structured reasoning chains to enable multi-head scoring. Our approach integrates contrastive learning with differentiable rationale generation, eliminating the need for handcrafted scoring criteria. The framework supports cross-task and cross-domain alignment without rubric adaptation. Empirically, it achieves significant improvements over state-of-the-art methods across multiple benchmarks. Crucially, outputs consist of both transparent, calibrated numerical scores and natural-language rationales, enhancing alignment with human values through improved flexibility and interpretability. All code, models, and data are publicly released.

Technology Category

Application Category

📝 Abstract

Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce R3, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. R3 enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3

Problem

Research questions and friction points this paper is trying to address.

Lack of controllability and interpretability in reward models

Limited generalizability to broader downstream tasks

Difficulty in interpreting scalar outputs without context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-agnostic reward modeling framework

Generalizable across evaluation dimensions

Interpretable reasoned score assignments

🔎 Similar Papers

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning