IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward Models

šŸ“… 2026-01-02
šŸ›ļø arXiv.org
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work addresses the O(n²) computational bottleneck in generative reward models (GRMs) caused by pairwise comparisons in reinforcement learning. The authors propose Intergroup Relative Preference Optimization (IRPO), which, for the first time, integrates the Bradley-Terry model into the GRPO framework by replacing pairwise comparisons with pointwise scoring. This approach preserves fine-grained reward signals and interpretability while substantially improving evaluation efficiency across multiple candidates. Experimental results demonstrate that IRPO achieves state-of-the-art performance among pointwise GRMs on multiple benchmarks, matching the effectiveness of leading pairwise GRMs and significantly outperforming them in post-training evaluations.

Technology Category

Application Category

šŸ“ Abstract
Generative Reward Models (GRMs) have demonstrated strong performance in reward modeling, due to their interpretability and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck in reinforcement learning from human feedback (RLHF), when calibrating or aggregating preference signals over n candidates, often incurring O(n^2) pairwise judgments. To address this issue, we propose Intergroup Relative Preference Modeling (IRPM), an RL-based method that extends the Bradley--Terry preference-learning paradigm via intergroup comparisons to train pointwise GRMs from pairwise preference data. IRPM derives pointwise reward for each response by contrasting groups of chosen vs. rejected samples, enabling pointwise scores comparable across candidate sets and O(n) reward evaluation for a variable number of candidates during RL training, while preserving interpretability and scalability. Experiments show that IRPM achieves state-of-the-art performance among pointwise GRMs on RM-Bench, JudgeBench and RewardBench, and approaches the performance of leading pairwise GRMs. In addition, IRPM achieves substantial gains in post-training evaluations, demonstrating its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Generative Reward Models
Bradley-Terry model
Reinforcement Learning
computational bottleneck
pairwise comparisons
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intergroup Relative Preference Optimization
Bradley-Terry model
Generative Reward Models
pointwise scoring
reinforcement learning
šŸ”Ž Similar Papers
No similar papers found.
H
Haonan Song
HUJING Digital Media & Entertainment Group (XingYun Lab), Beijing, China
Q
Qingchen Xie
Department of Automation, Tsinghua University, Beijing, China
H
Huan Zhu
HUJING Digital Media & Entertainment Group (XingYun Lab), Beijing, China
Feng Xiao
Feng Xiao
Beijing FengYun Vision Technology Co. Ltd.
Imaging3D visionAI
Luxi Xing
Luxi Xing
Institute of Information Engineering, Chinese Academy of Sciences
Natural Language Processing
F
Fuzhen Li
HUJING Digital Media & Entertainment Group (XingYun Lab), Beijing, China
L
Liu Kang
HUJING Digital Media & Entertainment Group (XingYun Lab), Beijing, China
F
Feng Jiang
HUJING Digital Media & Entertainment Group (XingYun Lab), Beijing, China
Z
Zhiyong Zheng
HUJING Digital Media & Entertainment Group (XingYun Lab), Beijing, China
Fan Yang
Fan Yang
Tsinghua University
MathematicsProbabilityStatistics