🤖 AI Summary
This study addresses the critical limitation of current medical vision-language models (VLMs), which often produce fluent yet erroneous responses when confronted with degraded or manipulated visual evidence, revealing a lack of capability to assess evidential reliability. To tackle this, the authors introduce the first systematically constructed evaluation framework, developed under continuous supervision by radiologists, that generates counterfactual samples through image perturbations and premise-altering question modifications. The framework incorporates clinical risk-level annotations, a refusal-to-answer mechanism, and multi-round expert collaborative labeling. A composite scoring metric, the Medical Correctness Score (MCS), is proposed alongside region-of-interest (ROI) annotations, paired multiple-choice and open-ended questions, and a seven-criteria conditional correctness audit. Experiments show radiologists achieve an MCS of 83.3, while the best-performing model, Claude Opus, attains only 69.2, highlighting a substantial gap in trustworthy medical reasoning. The dataset and evaluation toolkit are publicly released.
📝 Abstract
Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.