GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reward modeling has long suffered from heavy reliance on large-scale human preference annotations and insufficient explicit reasoning capabilities. To address these limitations, we propose GRAM-R², a generative foundation reward model that systematically introduces self-training into reward modeling for the first time, enabling autonomous acquisition of reward reasoning from unlabeled data. GRAM-R² unifies preference judgment and rationale generation within a single generative framework and achieves zero-shot transfer via multi-task generative pretraining. Empirically, GRAM-R² consistently outperforms state-of-the-art discriminative and generative baselines across response ranking, task adaptation, and reinforcement learning from human feedback (RLHF). It demonstrates significantly improved generalization and practical utility. By integrating self-supervised reasoning, unified generative modeling, and annotation-efficient learning, GRAM-R² establishes a new paradigm for building interpretable, annotation-light, and highly adaptable general-purpose reward models.

Technology Category

Application Category

📝 Abstract
Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.
Problem

Research questions and friction points this paper is trying to address.

Developing reward models without large labeled datasets
Instilling explicit reasoning into reward models
Creating general-purpose reward models requiring minimal fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training generative reward model with rationales
Leverages unlabeled data for reward reasoning
Foundation model for broad tasks without fine-tuning
🔎 Similar Papers
No similar papers found.
C
Chenglong Wang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yongyu Mu
Yongyu Mu
Northeastern University
multilingualismmachine translationefficient models
H
Hang Zhou
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yifu Huo
Yifu Huo
Northeastern University
Z
Ziming Zhu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Jiali Zeng
Jiali Zeng
Tencent
Natural Language ProcessingDeep LearningNeural Machine Translation
M
Murun Yang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Bei Li
Bei Li
Meituan LLM Team
Machine TranslationDeep LearningLarge Language Models
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Xiaoyang Hao
Xiaoyang Hao
Tencent
speech synthesis
C
Chunliang Zhang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing