🤖 AI Summary
Large language models (LLMs) suffer from external hallucinations—factually incorrect outputs contradicting training data—posing a fundamental trade-off between factual accuracy and open-ended generation capability.
Method: This paper proposes an online reinforcement learning framework with binary retrieval-augmented reward, wherein rewards are granted exclusively when model outputs are fully verified against retrieved evidence; this avoids performance degradation associated with continuous reward signals and inherently encourages abstention on unknown queries. Built upon the Qwen3 base model, the method integrates retrieval augmentation to construct fine-grained, factuality-oriented reward signals.
Results: Experiments demonstrate a 39.3% reduction in open-generation hallucination rate; error rates on PopQA and GPQA decrease by 44.4% and 21.7%, respectively; and no performance degradation is observed across downstream tasks—including instruction following, mathematical reasoning, and code generation—indicating robust generalization without sacrificing utility.
📝 Abstract
Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.