REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses a key limitation in existing LLM-as-a-Judge approaches for reinforcement learning (RL), which rely solely on binary rewards and discard the ordinal information inherent in human or model-generated scores. While regression-aware methods have shown promise, they are largely confined to supervised fine-tuning (SFT) and struggle to effectively explore reasoning trajectories. To bridge this gap, we propose REAL, the first framework that integrates regression objectives directly into RL. REAL employs a generalized policy gradient estimator to jointly optimize chain-of-thought trajectory exploration and regression accuracy, with theoretical guarantees of optimality for correlation-based metrics. Experiments across 8B–32B models demonstrate that REAL substantially outperforms both SFT and standard RL baselines. On Qwen3-32B, it achieves Pearson/Spearman correlation improvements of +8.40/+7.20 over SFT and +18.30/+11.20 over the base model, while also exhibiting strong out-of-domain generalization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge

Reinforcement Learning

Regression-aware

Policy Gradient

Chain-of-Thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

Regression-Aware Reinforcement Learning

LLM-as-a-Judge

Generalized Policy Gradient