CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the suboptimal performance of large language models (LLMs) when used as reward models for evaluating open-ended text generation. To overcome this limitation, the authors propose CE-RM-4B, a generative reward model based on pointwise evaluation that departs from the prevailing pairwise comparison paradigm. The approach introduces a unified query-aware evaluation criterion and employs an efficient two-stage rollout sampling strategy during training. Trained on only 5.7K high-quality preference data points, CE-RM-4B achieves state-of-the-art performance across multiple reward modeling benchmarks, significantly outperforming existing methods in Best-of-N evaluation scenarios and effectively enhancing downstream reinforcement learning tasks.

Technology Category

Application Category

📝 Abstract
Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.
Problem

Research questions and friction points this paper is trying to address.

reward model
LLM-as-a-Judge
natural language generation
reinforcement learning
evaluation criteria
Innovation

Methods, ideas, or system contributions that make the work stand out.

pointwise reward model
two-stage rollout
unified evaluation criteria
LLM-as-a-Judge
reinforcement learning
🔎 Similar Papers
No similar papers found.