From Faithfulness to Correctness: Generative Reward Models that Think Critically

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In open-domain question answering, answer correctness is difficult to verify, and existing faithfulness-based reinforcement learning approaches often cause models to over-rely on external documents, undermining critical evaluation capabilities. To address this, we propose a correctness-centered reward modeling paradigm, introducing chain-of-thought reasoning into generative reward models for the first time. Our approach establishes a three-tier evaluation framework: (1) sentence-level faithfulness judgment, (2) explicit modeling of the reasoning process, and (3) final correctness assessment. By incorporating fine-grained sentence-level supervision and explicitly modeling intermediate reasoning steps, the reward model acquires critical reasoning abilities, significantly improving its precision in detecting incorrect answers. Experiments demonstrate that our method substantially enhances reward signal quality, leading to more effective policy optimization across multiple open-domain QA benchmarks. Both answer correctness and practical utility show significant improvements.

Technology Category

Application Category

📝 Abstract

Through reinforcement learning with verifiable rewards (RLVR), large language models have achieved substantial progress in domains with easily verifiable outcomes, such as mathematics and coding. However, when applied to more complex tasks like open-domain question answering, RLVR faces significant challenges due to the difficulty of verifying correctness. The nuanced and ambiguous nature of real-world knowledge makes it difficult to reliably evaluate correctness in these settings, necessitating further abilities that extend beyond mere logical consistency to encompass an understanding and assessment of both external and internal knowledge. Recent work has primarily focused on improving faithfulness, defined as semantic alignment with supporting documents, which can cause models to rely excessively on external sources and diminish their capacity for critical assessment. To address this, we propose the Thinking-supervised Reward Model (TRM), which incorporates sentence-level thinking supervision to endow reward models with critical thinking abilities. Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness. By structuring reward modeling as a sequence of faithfulness, reasoning, and correctness evaluations, TRM encourages models to critically assess and leverage both external and internal knowledge. Experiments on reward signals demonstrate that TRM substantially improves the identification of incorrect sentences, and incorporating TRM into policy optimization leads to significant gains in both answer correctness and usefulness.

Problem

Research questions and friction points this paper is trying to address.

Addresses difficulty in verifying correctness for complex open-domain question answering

Mitigates over-reliance on external sources by enhancing critical thinking abilities

Improves identification of incorrect sentences through structured reward modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Thinking-supervised Reward Model incorporates sentence-level thinking supervision

TRM structures reward modeling as faithfulness reasoning correctness evaluations

TRM enables critical assessment of both external and internal knowledge

🔎 Similar Papers

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

2024-10-01arXiv.orgCitations: 10

💼 Related Jobs

Researcher, Alignment Science

OpenAI

$250K – $445K • Offers Equity

San Francisco, CA, USA / Remote

Applied Deep Learning PhD Research Intern, Reinforcement Learning for LLMs - Fall 2026

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Member of Technical Staff - Post Training - MAI Superintelligence Team

Microsoft

$119,800 -

San Francisco Bay area / New York City metropolitan area

Authors to Follow