🤖 AI Summary
Current evaluations of medical large language models (LLMs) predominantly rely on exam-style benchmarks that lack clinical ambiguity, failing to capture the uncertainty inherent in real-world diagnosis and treatment. This work proposes the CLEAR framework, which systematically perturbs key aspects of multiple-choice questions—including the number of answer options, the presence or absence of a correct answer, the inclusion of an explicit “I don’t know” (IDK) option, and semantic phrasing—to evaluate 17 LLMs across three established medical benchmarks. The study quantifies, for the first time, the “lack of humility” exhibited by LLMs under ambiguity: adding plausible distractors significantly impairs their ability to identify correct answers and reject incorrect ones; introducing an IDK option paradoxically increases erroneous selections; and larger models display more pronounced overconfidence.
📝 Abstract
Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like "None of the Above" to uncertainty admission like "I don't know" (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.