How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Ensuring clinical reliability of vision-language models (VLMs) for zero-shot medical visual question answering (VQA) remains challenging. Method: We systematically evaluate open-source general-purpose and medical-domain-specific VLMs across eight medical VQA benchmarks (e.g., MedXpert, PMC-VQA, PathVQA), decoupling model competence into fine-grained “comprehension” and “reasoning” dimensions to assess zero-shot transfer capability. Contribution/Results: While large-scale general VLMs match domain-specific models in visual–linguistic comprehension, they exhibit significant deficits in clinical reasoning—constituting a critical safety bottleneck for deployment. Performance varies markedly across benchmarks, primarily due to heterogeneity in task design, annotation quality, and domain-knowledge requirements. No evaluated model meets clinical reliability standards for real-world deployment. We identify two urgent needs: (1) enhanced multimodal alignment mechanisms robust to medical semantics, and (2) a refined, clinically grounded evaluation framework that captures task-specific reasoning demands. These findings underscore the insufficiency of current VLMs for safe clinical automation and call for targeted architectural and evaluative advances.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.

Problem

Research questions and friction points this paper is trying to address.

Evaluate medical VLMs' performance on diverse benchmarks

Assess understanding vs reasoning gaps in medical VLMs

Identify reliability barriers for clinical VLM deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating VLMs on medical benchmarks comprehensively

Separating performance into understanding and reasoning components

Highlighting need for stronger multimodal alignment

🔎 Similar Papers

No similar papers found.