RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of unreliable absolute reward signals in subjective and unverifiable domains, where conventional pointwise reward modeling struggles to provide consistent supervision. To overcome this limitation, the authors propose an alternating training framework that jointly optimizes a scoring criterion generator and a discriminator conditioned on the generated criteria, using only pairwise preference data. The approach innovatively incorporates probabilistic scoring rules to mitigate ties and integrates stage-specific preference rewards with an alternating GRPO strategy to enable efficient and stable training of pointwise evaluators. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches in reward modeling accuracy and consistently enhances downstream policy post-training performance.

📝 Abstract

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

non-verifiable domains

rubric-based evaluation

scoring ties

LLM post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-based reward modeling

alternating training

pairwise preference