J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the weak discriminative capability and susceptibility to positional bias of LLM-as-judge in reasoning-intensive tasks, this paper proposes EIS-GRPO, a novel reinforcement learning algorithm featuring Equivalent Initial-State Grouped Relative Policy Optimization. We further introduce ReasoningJudgeBench—the first benchmark dedicated to reasoning evaluation. Methodologically, our approach integrates state-equivalence modeling, multi-dimensional fine-tuning on reasoning tasks, and the GRPO training paradigm. The resulting 7B-sized J4R judge, trained under this framework, achieves a 6.7% absolute improvement over GPT-4o and outperforms the best-performing small-scale judges by 9% on ReasoningJudgeBench. Notably, its performance matches that of significantly larger GRPO-based judges. This advancement substantially enhances the robustness and accuracy of automated evaluation for complex reasoning outputs.

Technology Category

Application Category

📝 Abstract

To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM-as-judge models for reasoning-intensive domains

Addressing positional biases in complex evaluation settings

Developing a robust benchmark for diverse reasoning evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

EIS-GRPO algorithm reduces positional bias

ReasoningJudgeBench evaluates diverse reasoning settings

J4R outperforms GPT-4o and small judges

🔎 Similar Papers

No similar papers found.

Authors to Follow