CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
Current evaluations of medical large language models (LLMs) predominantly rely on exam-style benchmarks that lack clinical ambiguity, failing to capture the uncertainty inherent in real-world diagnosis and treatment. This work proposes the CLEAR framework, which systematically perturbs key aspects of multiple-choice questions—including the number of answer options, the presence or absence of a correct answer, the inclusion of an explicit “I don’t know” (IDK) option, and semantic phrasing—to evaluate 17 LLMs across three established medical benchmarks. The study quantifies, for the first time, the “lack of humility” exhibited by LLMs under ambiguity: adding plausible distractors significantly impairs their ability to identify correct answers and reject incorrect ones; introducing an IDK option paradoxically increases erroneous selections; and larger models display more pronounced overconfidence.
📝 Abstract
Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like "None of the Above" to uncertainty admission like "I don't know" (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.
Problem

Research questions and friction points this paper is trying to address.

medical LLMs
evaluation benchmarks
ambiguity
uncertainty
reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLEAR framework
ambiguity in medical LLMs
humility deficit
abstention behavior
reliability evaluation
🔎 Similar Papers
No similar papers found.
K
Kevin H. Guo
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Chao Yan
Chao Yan
Instructor at DBMI, VUMC; CS PhD from Vanderbilt U
AI for medicineSynthetic health dataPrivacyFairness
A
Avinash Baidya
Intuit AI Research, Mountain View, CA, USA
K
Katherine Brown
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Xiang Gao
Xiang Gao
Intuit
deep learning
Juming Xiong
Juming Xiong
Vanderbilt University
deep learningcomputer visionmedical image processing
Z
Zhijun Yin
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
B
Bradley A. Malin
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA