Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing approaches to generating evaluation rubrics often suffer from incomplete coverage, dimension conflation, misaligned preferences, and redundancy-related issues, which hinder the judgment accuracy and reward quality of large language models (LLMs). To address these limitations, this work proposes the RRD framework, which introduces a recursive rubric refinement mechanism: coarse-grained criteria are recursively decomposed into fine-grained, discriminative judgments, while relevance-aware filtering and weighting strategies eliminate redundant or misaligned components to achieve multi-dimensional preference alignment. The method substantially improves LLM judgment accuracy on JudgeBench and PPE—by up to 17.7 points—and, when used as a reward source in RFT, boosts reward scores by 160% for Qwen3-4B and 60% for Llama3.1-8B on WildChat. Furthermore, it demonstrates strong generalization to challenging benchmarks such as HealthBench-Hard and BiGGen Bench.

Technology Category

Application Category

📝 Abstract

Recently, rubrics have been used to guide LLM judges in capturing subjective, nuanced, multi-dimensional human preferences, and have been extended from evaluation to reward signals for reinforcement fine-tuning (RFT). However, rubric generation remains hard to control: rubrics often lack coverage, conflate dimensions, misalign preference direction, and contain redundant or highly correlated criteria, degrading judge accuracy and producing suboptimal rewards during RFT. We propose RRD, a principled framework for rubric refinement built on a recursive decompose-filter cycle. RRD decomposes coarse rubrics into fine-grained, discriminative criteria, expanding coverage while sharpening separation between responses. A complementary filtering mechanism removes misaligned and redundant rubrics, and a correlation-aware weighting scheme prevents over-representing highly correlated criteria, yielding rubric sets that are informative, comprehensive, and non-redundant. Empirically, RRD delivers large, consistent gains across both evaluation and training: it improves preference-judgment accuracy on JudgeBench and PPE for both GPT-4o and Llama3.1-405B judges, achieving top performance in all settings with up to +17.7 points on JudgeBench. When used as the reward source for RFT on WildChat, it yields substantially stronger and more stable learning signals, boosting reward by up to 160% (Qwen3-4B) and 60% (Llama3.1-8B) versus 10-20% for prior rubric baselines, with gains that transfer to HealthBench-Hard and BiGGen Bench. Overall, RRD establishes recursive rubric refinement as a scalable and interpretable foundation for LLM judging and reward modeling in open-ended domains.

Problem

Research questions and friction points this paper is trying to address.

rubric generation

LLM judge

reward modeling

open-ended tasks

preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric refinement

recursive decomposition

LLM judging