š¤ AI Summary
This work addresses the O(n²) computational bottleneck in generative reward models (GRMs) caused by pairwise comparisons in reinforcement learning. The authors propose Intergroup Relative Preference Optimization (IRPO), which, for the first time, integrates the Bradley-Terry model into the GRPO framework by replacing pairwise comparisons with pointwise scoring. This approach preserves fine-grained reward signals and interpretability while substantially improving evaluation efficiency across multiple candidates. Experimental results demonstrate that IRPO achieves state-of-the-art performance among pointwise GRMs on multiple benchmarks, matching the effectiveness of leading pairwise GRMs and significantly outperforming them in post-training evaluations.
š Abstract
Generative Reward Models (GRMs) have demonstrated strong performance in reward modeling, due to their interpretability and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck in reinforcement learning from human feedback (RLHF), when calibrating or aggregating preference signals over n candidates, often incurring O(n^2) pairwise judgments. To address this issue, we propose Intergroup Relative Preference Modeling (IRPM), an RL-based method that extends the Bradley--Terry preference-learning paradigm via intergroup comparisons to train pointwise GRMs from pairwise preference data. IRPM derives pointwise reward for each response by contrasting groups of chosen vs. rejected samples, enabling pointwise scores comparable across candidate sets and O(n) reward evaluation for a variable number of candidates during RL training, while preserving interpretability and scalability. Experiments show that IRPM achieves state-of-the-art performance among pointwise GRMs on RM-Bench, JudgeBench and RewardBench, and approaches the performance of leading pairwise GRMs. In addition, IRPM achieves substantial gains in post-training evaluations, demonstrating its effectiveness.