TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models (LLMs) face a fundamental trade-off in knowledge-intensive question answering: improving accuracy often exacerbates hallucination, while enhancing uncertainty awareness sacrifices recall. This work proposes TruthRL, the first framework to directly optimize *truthfulness*—defined as the joint calibration of factual correctness and confidence—as the primary objective of reinforcement learning. TruthRL introduces a ternary reward function that explicitly distinguishes correct answers, hallucinated outputs, and abstentions, thereby jointly modeling factuality and confidence. Leveraging the GRPO algorithm, TruthRL enables end-to-end optimization in both retrieval-augmented and non-retrieval settings. Experiments demonstrate that TruthRL reduces hallucination rates by 28.9% and improves overall truthfulness by 21.1% over strong baselines. Crucially, these gains are consistent across diverse backbone models—including Qwen and Llama—demonstrating robust generalizability and effectively resolving the accuracy–conservatism trade-off.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

Problem

Research questions and friction points this paper is trying to address.

Optimizing truthfulness in LLMs using reinforcement learning

Reducing hallucinations while enabling appropriate abstention

Balancing factual accuracy with uncertainty recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework optimizes LLM truthfulness

Ternary reward distinguishes correct, hallucinated, abstained responses

Balances accuracy and uncertainty through truthfulness-driven objective

🔎 Similar Papers

Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories