S2J: Bridging the Gap Between Solving and Judging Ability in Generative Reward Models

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative Reward Models (GRMs) suffer from a “solve-to-judge gap”: strong problem-solving capability but weak judgment accuracy. This paper formally defines this phenomenon for the first time and proposes Solve-to-Judge (S2J), an end-to-end self-evolving training framework that jointly supervises both solution generation and judgment outputs within a single GRM—eliminating reliance on stronger external models. S2J enables dual-task co-training on preference data, thereby improving alignment between solving and judging capabilities. Experiments demonstrate that, under identical base models, S2J achieves state-of-the-art performance: it improves judgment accuracy by 5.8% using less training data and reduces the solve-to-judge gap by 16.2%, significantly outperforming prior approaches.

Technology Category

Application Category

📝 Abstract
With the rapid development of large language models (LLMs), generative reward models (GRMs) have been widely adopted for reward modeling and evaluation. Previous studies have primarily focused on training specialized GRMs by optimizing them on preference datasets with the judgment correctness as supervision. While it's widely accepted that GRMs with stronger problem-solving capabilities typically exhibit superior judgment abilities, we first identify a significant solve-to-judge gap when examining individual queries. Specifically, the solve-to-judge gap refers to the phenomenon where GRMs struggle to make correct judgments on some queries (14%-37%), despite being fully capable of solving them. In this paper, we propose the Solve-to-Judge (S2J) approach to address this problem. Specifically, S2J simultaneously leverages both the solving and judging capabilities on a single GRM's output for supervision, explicitly linking the GRM's problem-solving and evaluation abilities during model optimization, thereby narrowing the gap. Our comprehensive experiments demonstrate that S2J effectively reduces the solve-to-judge gap by 16.2%, thereby enhancing the model's judgment performance by 5.8%. Notably, S2J achieves state-of-the-art (SOTA) performance among GRMs built on the same base model while utilizing a significantly smaller training dataset. Moreover, S2J accomplishes this through self-evolution without relying on more powerful external models for distillation.
Problem

Research questions and friction points this paper is trying to address.

Generative reward models show solve-judge capability gap
Models fail correct judgment despite solving ability
Method links solving and judging during optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages solving and judging capabilities for supervision
Links problem-solving and evaluation abilities during optimization
Reduces solve-to-judge gap through self-evolution
🔎 Similar Papers
No similar papers found.
S
Shaoning Sun
Tsinghua Shenzhen International Graduate School, Tsinghua University
J
Jiachen Yu
Tsinghua Shenzhen International Graduate School, Tsinghua University
Zongqi Wang
Zongqi Wang
Tsinghua University
Reward ModelRLEvaluationAI Safety
X
Xuewei Yang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Tianle Gu
Tianle Gu
Tsinghua University
(M)LLM SafetyPEFT
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision