Learning to Reason for Factuality

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit severe hallucination in long-text generation and lack reliable factual verification mechanisms, hindering online reinforcement learning (RL) deployment. To address this, we propose an online RL framework explicitly optimized for factual accuracy. Our method introduces a multidimensional reward function jointly modeling factual precision, detail richness, and answer relevance—thereby mitigating reward hacking inherent in conventional automated metrics. We construct preference data by integrating FActScore and other factual evaluation tools, and train the model via multi-objective optimization. Evaluated on six long-text factual consistency benchmarks, our approach reduces average hallucination rate by 23.1 percentage points and improves detail completeness by 23%, while preserving response utility. Our core contribution is the first integration of a multidimensional, trustworthy reward mechanism with online RL, significantly enhancing the factual robustness of reasoning LLMs in long-text generation.

Technology Category

Application Category

📝 Abstract
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
Problem

Research questions and friction points this paper is trying to address.

R-LLMs struggle with factuality and generate hallucinations
Online RL lacks reliable verification for long-form factuality
Existing reward methods cause reward hacking in responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel reward function for factual precision
Online RL optimizes detail and relevance
Reduces hallucinations by 23.1 percentage points
🔎 Similar Papers
No similar papers found.