🤖 AI Summary
This study addresses the "premature closure" problem in state-of-the-art large language models (LLMs) within clinical settings—where models issue confident responses despite insufficient information, potentially leading to misdiagnosis. The work provides the first systematic definition and quantification of this phenomenon, introducing a novel evaluation dimension that assesses a model’s awareness of its own knowledge limitations. Using medical benchmarks such as MedQA, AfriMed-QA, and HealthBench, the authors evaluate five leading LLMs on both structured and open-ended tasks, employing adversarially crafted queries from physicians and safety-oriented prompt engineering. Results reveal that when the correct answer is absent, models erroneously respond with high confidence at rates ranging from 53% to 82%. While safety-focused prompting partially mitigates the issue, it fails to eliminate premature closure entirely.
📝 Abstract
Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.