MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses a critical reliability gap in medical vision-language models (VLMs), which often generate plausible yet erroneous diagnoses when presented with inputs exhibiting modality, anatomical, or viewpoint inconsistencies, revealing a lack of basic input validity verification. To tackle this, the authors introduce MedObvious, a novel benchmark that formalizes pre-clinical “sanity checks” as a set-level consistency validation task, comprising 1,880 multi-image panel assessments organized into five progressive difficulty tiers. Through systematic evaluation of 17 prominent VLMs across five interaction formats—including multiple-choice and open-ended questions—the study uncovers pervasive weaknesses: models frequently hallucinate on negative samples, exhibit performance degradation as image-set size increases, and show high sensitivity to response format. These findings underscore significant reliability deficiencies in current VLMs for safety-critical clinical applications.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

Problem

Research questions and friction points this paper is trying to address.

Medical Vision Language Models

Input Validation

Clinical Triage

Moravec's Paradox

Visual Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Vision-Language Models

Input Validation

Clinical Triage