Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large language models (LLMs) suffer from external hallucinations—factually incorrect outputs contradicting training data—posing a fundamental trade-off between factual accuracy and open-ended generation capability. Method: This paper proposes an online reinforcement learning framework with binary retrieval-augmented reward, wherein rewards are granted exclusively when model outputs are fully verified against retrieved evidence; this avoids performance degradation associated with continuous reward signals and inherently encourages abstention on unknown queries. Built upon the Qwen3 base model, the method integrates retrieval augmentation to construct fine-grained, factuality-oriented reward signals. Results: Experiments demonstrate a 39.3% reduction in open-generation hallucination rate; error rates on PopQA and GPQA decrease by 44.4% and 21.7%, respectively; and no performance degradation is observed across downstream tasks—including instruction following, mathematical reasoning, and code generation—indicating robust generalization without sacrificing utility.

Technology Category

Application Category

📝 Abstract

Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.

Problem

Research questions and friction points this paper is trying to address.

Reducing language model hallucinations while maintaining generation quality

Addressing tradeoff between factuality and task performance degradation

Implementing binary rewards for factual correctness in model outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary retrieval-augmented reward for reinforcement learning

Reward given only for fully factually correct outputs

Enables calibrated abstention when knowledge is insufficient

🔎 Similar Papers

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models