Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Traditional reinforcement learning struggles to support generalization in complex reasoning tasks due to sparse or monolithic reward signals. This work proposes a rubric-based reinforcement learning framework that leverages a frozen large language model (LLM) as a judge to provide fine-grained, partial-credit rewards according to multidimensional, verifiable, and weighted criteria. To enhance generalization, the method incorporates document evidence unseen during training. It is the first approach to integrate structured scoring rubrics with external grounding, establishing a transferable multi-criterion reward mechanism. Optimized via GRPO on Llama-3.1-8B-Instruct, the model achieves a normalized reward of 71.7% under held-out rubric evaluation and significantly outperforms baselines on unseen benchmarks including GSM8K, MATH, GPQA Main, and Diamond.

📝 Abstract

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

reasoning generalization

structured reward

LLM judge

rubric evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-grounded reinforcement learning

structured reward

LLM judge