GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Reward modeling has long suffered from heavy reliance on large-scale human preference annotations and insufficient explicit reasoning capabilities. To address these limitations, we propose GRAM-R², a generative foundation reward model that systematically introduces self-training into reward modeling for the first time, enabling autonomous acquisition of reward reasoning from unlabeled data. GRAM-R² unifies preference judgment and rationale generation within a single generative framework and achieves zero-shot transfer via multi-task generative pretraining. Empirically, GRAM-R² consistently outperforms state-of-the-art discriminative and generative baselines across response ranking, task adaptation, and reinforcement learning from human feedback (RLHF). It demonstrates significantly improved generalization and practical utility. By integrating self-supervised reasoning, unified generative modeling, and annotation-efficient learning, GRAM-R² establishes a new paradigm for building interpretable, annotation-light, and highly adaptable general-purpose reward models.

Technology Category

Application Category

📝 Abstract

Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

Problem

Research questions and friction points this paper is trying to address.

Developing reward models without large labeled datasets

Instilling explicit reasoning into reward models

Creating general-purpose reward models requiring minimal fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training generative reward model with rationales

Leverages unlabeled data for reward reasoning

Foundation model for broad tasks without fine-tuning

🔎 Similar Papers

No similar papers found.