Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address factual hallucinations in large language models (LLMs) arising from training objectives prioritizing data fitting over truthfulness in safety-critical applications, this paper proposes a behavior-calibration-based reinforcement learning framework. Its core innovation lies in adopting strictly proper scoring rules to decouple uncertainty quantification from predictive accuracy, enabling the model to actively abstain from answering or provide fine-grained uncertainty annotations based on confidence—thereby substantially enhancing metacognitive capability. Evaluated on Qwen3-4B, the method achieves superior hallucination suppression on BeyondAIME (mathematical reasoning), outperforming GPT-5 with an improved Accuracy-to-Hallucination Ratio of 0.806; it also matches Grok-4 and Gemini-2.5-Pro in zero-shot calibration error on SimpleQA. Notably, this work marks the first demonstration of a small-scale model surpassing larger models in uncertainty-aware reasoning.

Technology Category

Application Category

📝 Abstract

LLM deployment in critical domains is currently impeded by persistent hallucinations--generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification--a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.

Problem

Research questions and friction points this paper is trying to address.

Mitigates LLM hallucinations via behavioral calibration

Optimizes models to abstain when uncertain, improving honesty

Enables smaller models to surpass larger ones in uncertainty quantification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral calibration for uncertainty admission

Training with strictly proper scoring rules

Small models surpassing frontier models in calibration

🔎 Similar Papers

No similar papers found.