Mind the Ambiguity: Aleatoric Uncertainty Quantification in LLMs for Safe Medical Question Answering

📅 2026-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the safety and accuracy risks posed by ambiguous user queries in medical question answering with large language models (LLMs). The authors first reveal that aleatoric uncertainty (AU)—arising from inherent input ambiguity—is linearly encoded in the internal activations of LLMs. Building on this insight, they propose AU-Probe, a lightweight module that detects query ambiguity without requiring model fine-tuning or multiple inference passes. Leveraging AU-Probe, they introduce a “clarify-then-answer” framework and establish CV-MedBench, a new benchmark for evaluating ambiguity handling in medical QA. Experiments across four open-source LLMs demonstrate an average accuracy improvement of 9.48%, significantly enhancing the safety and reliability of medical responses.

Technology Category

Application Category

📝 Abstract
The deployment of Large Language Models in Medical Question Answering is severely hampered by ambiguous user queries, a significant safety risk that demonstrably reduces answer accuracy in high-stakes healthcare settings. In this paper, we formalize this challenge by linking input ambiguity to aleatoric uncertainty (AU), which is the irreducible uncertainty arising from underspecified input. To facilitate research in this direction, we construct CV-MedBench, the first benchmark designed for studying input ambiguity in Medical QA. Using this benchmark, we analyze AU from a representation engineering perspective, revealing that AU is linearly encoded in LLM's internal activation patterns. Leveraging this insight, we introduce a novel AU-guided"Clarify-Before-Answer"framework, which incorporates AU-Probe - a lightweight module that detects input ambiguity directly from hidden states. Unlike existing uncertainty estimation methods, AU-Probe requires neither LLM fine-tuning nor multiple forward passes, enabling an efficient mechanism to proactively request user clarification and significantly enhance safety. Extensive experiments across four open LLMs demonstrate the effectiveness of our QA framework, with an average accuracy improvement of 9.48% over baselines. Our framework provides an efficient and robust solution for safe Medical QA, strengthening the reliability of health-related applications. The code is available at https://github.com/yaokunliu/AU-Med.git, and the CV-MedBench dataset is released on Hugging Face at https://huggingface.co/datasets/yaokunl/CV-MedBench.
Problem

Research questions and friction points this paper is trying to address.

Ambiguity
Aleatoric Uncertainty
Medical Question Answering
Large Language Models
Safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aleatoric Uncertainty
Medical Question Answering
Uncertainty Quantification
Representation Engineering
Clarify-Before-Answer
🔎 Similar Papers
No similar papers found.
Y
Yaokun Liu
University of Illinois Urbana-Champaign
Y
Yifan Liu
University of Illinois Urbana-Champaign
P
Phoebe Mbuvi
University of Illinois Urbana-Champaign
Z
Zelin Li
University of Illinois Urbana-Champaign
R
Ruichen Yao
University of Illinois Urbana-Champaign
G
Gawon Lim
University of Illinois Urbana-Champaign
Dong Wang
Dong Wang
Professor at University of Illinois Urbana-Champaign
Social SensingHuman-centered AISocial IntelligenceAI for Social GoodAI for Science