Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of large language models to hallucination in fact-sensitive scenarios, which undermines their reliability. To enhance knowledge humility, the authors propose a Reinforcement Learning with Verifiable Rewards (RLVR) framework that integrates a ternary reward mechanism with supervised fine-tuning, incentivizing models to abstain from answering—by responding “I don’t know”—when uncertain. Experiments on MedMCQA and Hendrycks Math benchmarks using Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct demonstrate that a moderate abstention reward (r_abs ≈ −0.25 to 0.3) substantially reduces multiple-choice errors, with larger models exhibiting greater robustness to abstention incentives. Furthermore, supervised fine-tuning effectively mitigates insufficient exploration in open-domain question answering, suppressing erroneous responses while preserving high accuracy.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic-Slice/rl-abstention.
Problem

Research questions and friction points this paper is trying to address.

hallucination
intellectual humility
abstention
large language models
reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning with Verifiable Rewards
intellectual humility
abstention training
hallucination mitigation
ternary reward structure
🔎 Similar Papers
No similar papers found.
Abha Jha
Abha Jha
University of Southern California
Computer VisionGenerative AILarge Language Models
A
Akanksha Mahajan
University of Southern California
Ashwath Vaithinathan Aravindan
Ashwath Vaithinathan Aravindan
University of Southern California
Deep LearningInterpretability
P
Praveen Saravanan
University of Southern California
S
Sai Sailaja Policharla
University of Southern California
S
Sonal Chaturbhuj Gehlot
University of Southern California