MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing medical multimodal language models (MLMs) show promise for clinical decision support but exhibit poor performance on fundamental visual perception tasks—such as image orientation judgment and CT contrast-phase identification—hindering real-world clinical adoption. Method: We introduce MedBLINK, the first clinical-deployment-oriented medical visual perception benchmark, comprising eight clinically relevant tasks, 1,605 medical images, and 1,429 multiple-choice questions. We propose a clinical-adoption-driven visual grounding evaluation paradigm using a multimodal question-answering framework to systematically assess 19 state-of-the-art MLMs. Contribution/Results: Human experts achieve 96.4% accuracy, while the best-performing model attains only 65%, revealing a substantial gap in foundational visual understanding. MedBLINK is the first benchmark to expose systematic limitations of MLMs in low-level medical visual comprehension, providing a reproducible, clinically grounded evaluation standard to guide model improvement and trustworthy deployment.

Technology Category

Application Category

📝 Abstract

Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.

Problem

Research questions and friction points this paper is trying to address.

Assessing MLMs' basic perceptual abilities in medical imaging

Identifying gaps in visual grounding for clinical adoption

Evaluating performance on routine perception tasks versus humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for perceptual abilities in MLMs

Evaluates 19 state-of-the-art multimodal models

Highlights visual grounding gaps in clinical AI

🔎 Similar Papers

No similar papers found.