Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

📅 2026-02-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating response quality in non-verifiable domains—such as creative writing—where traditional reward models struggle to capture the multidimensional nature of output quality. To this end, the authors propose Rubric-ARM, a framework that integrates dynamic rubric generation with preference-based feedback, treating scoring rubrics as implicit actions. The approach jointly optimizes a rubric generator and a critic through reinforcement learning, employing an alternating training mechanism to effectively reduce gradient variance and enhance judgment accuracy. Experimental results demonstrate that Rubric-ARM achieves state-of-the-art performance across multiple benchmarks, significantly improving alignment between downstream policies and human preferences in both offline and online reinforcement learning settings.

Technology Category

Application Category

📝 Abstract
Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
Problem

Research questions and friction points this paper is trying to address.

reward modeling
non-verifiable domains
response quality
rubric-based evaluation
LLM post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-based reward modeling
alternating reinforcement learning
non-verifiable domains
preference feedback
gradient variance reduction