Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates three vision-language foundation models—RAD-DINO (self-supervised), CheXagent (text-supervised), and BiomedCLIP—on pneumothorax and cardiomegaly tasks in chest X-rays, assessing their performance across classification, segmentation, and regression. Method: We analyze how pretraining paradigms influence task-specific efficacy, revealing that RAD-DINO excels in fine-grained segmentation due to its text-free representation learning, whereas CheXagent achieves superior classification accuracy and interpretability via textual guidance. Leveraging these insights, we propose a lightweight, task-customized segmentation architecture integrating global and local features. Contribution/Results: Our architecture boosts mean Intersection-over-Union (mIoU) by 12.3% on average across all baseline models, notably improving segmentation of challenging cases such as pneumothorax. This work provides the first empirical, multi-task, multi-paradigm guideline for selecting radiology AI models and uncovers a principled correspondence between pretraining paradigms and the granularity of downstream medical imaging tasks.

Technology Category

Application Category

📝 Abstract
Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for fine-grained radiology imaging tasks
Comparing performance of RAD-DINO, CheXagent, BiomedCLIP models
Assessing impact of pre-training on classification and segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised RAD-DINO excels in segmentation tasks
Text-supervised CheXagent leads in classification performance
Custom segmentation model integrates global and local features
🔎 Similar Papers
No similar papers found.
F
Frank Li
Department of Radiology, Emory University, Atlanta, GA, USA
Hari Trivedi
Hari Trivedi
Emory University
Deep LearningRadiologyMammographyAINatural Language Processing
Bardia Khosravi
Bardia Khosravi
Radiology Resident @ Yale
RadiologyArtificial IntelligenceImaging Informatics
Theo Dapamede
Theo Dapamede
Emory University
Artificial IntelligenceRadiologyImaging InformaticsPhoton Counting CT
Mohammadreza Chavoshi
Mohammadreza Chavoshi
MD, Postdoctoral Researcher, Emory University
Radiologymeta-analysisArtificial Intelligence
A
Abdulhameed Dere
Faculty of Clinical Sciences, College of Health Sciences, University of Ilorin, Ilorin, Nigeria
Rohan Satya Isaac
Rohan Satya Isaac
Emory University
HealthcareRadiologyAI
A
Aawez Mansuri
Department of Radiology, Emory University, Atlanta, GA, USA
J
Janice Newsome
Department of Radiology, Emory University, Atlanta, GA, USA
Saptarshi Purkayastha
Saptarshi Purkayastha
Indiana University Indianapolis
global healthEHRimaging informaticsmHealthinformation infrastructure
J
Judy Gichoya
Department of Radiology, Emory University, Atlanta, GA, USA