🤖 AI Summary
This work addresses the challenge of unreliable absolute reward signals in subjective and unverifiable domains, where conventional pointwise reward modeling struggles to provide consistent supervision. To overcome this limitation, the authors propose an alternating training framework that jointly optimizes a scoring criterion generator and a discriminator conditioned on the generated criteria, using only pairwise preference data. The approach innovatively incorporates probabilistic scoring rules to mitigate ties and integrates stage-specific preference rewards with an alternating GRPO strategy to enable efficient and stable training of pointwise evaluators. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches in reward modeling accuracy and consistently enhances downstream policy post-training performance.
📝 Abstract
Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.