The Hallucination Tax of Reinforcement Finetuning

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical trade-off in Reinforcement Fine-Tuning (RFT): while RFT enhances large language models’ (LLMs) reasoning capabilities, it severely degrades their ability to abstain from answering unanswerable questions—a phenomenon termed the “hallucination tax,” wherein models generate high-confidence hallucinated responses to unanswerable queries. To address this, the authors formally define and quantify the hallucination tax, introduce SUM—a high-quality synthetic benchmark of unanswerable mathematical problems—and propose a data augmentation method based on SUM. Using only 10% unanswerable samples from SUM, their approach effectively restores model uncertainty awareness and abstention behavior, enabling inference-time uncertainty calibration. Experiments show that RFT reduces abstention rates by over 80%; integrating SUM recovers and even exceeds baseline abstention performance, generalizing robustly to out-of-domain mathematical and factual QA tasks, while preserving near-original accuracy on primary reasoning tasks.

Technology Category

Application Category

📝 Abstract
Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models' ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model's tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to out-of-domain math problems but also to factual question answering tasks.
Problem

Research questions and friction points this paper is trying to address.

Investigates degradation in refusal behavior from reinforcement finetuning
Studies hallucination tax causing confident wrong answers
Proposes dataset to improve model recognition of unanswerable questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SUM dataset for unanswerable math problems
Incorporates 10% SUM during RFT to reduce hallucination
Enhances LLMs' reasoning on uncertainty and knowledge boundaries
🔎 Similar Papers
No similar papers found.