Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This study systematically evaluates three vision-language foundation models—RAD-DINO (self-supervised), CheXagent (text-supervised), and BiomedCLIP—on pneumothorax and cardiomegaly tasks in chest X-rays, assessing their performance across classification, segmentation, and regression. Method: We analyze how pretraining paradigms influence task-specific efficacy, revealing that RAD-DINO excels in fine-grained segmentation due to its text-free representation learning, whereas CheXagent achieves superior classification accuracy and interpretability via textual guidance. Leveraging these insights, we propose a lightweight, task-customized segmentation architecture integrating global and local features. Contribution/Results: Our architecture boosts mean Intersection-over-Union (mIoU) by 12.3% on average across all baseline models, notably improving segmentation of challenging cases such as pneumothorax. This work provides the first empirical, multi-task, multi-paradigm guideline for selecting radiology AI models and uncovers a principled correspondence between pretraining paradigms and the granularity of downstream medical imaging tasks.

Technology Category

Application Category

📝 Abstract

Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for fine-grained radiology imaging tasks

Comparing performance of RAD-DINO, CheXagent, BiomedCLIP models

Assessing impact of pre-training on classification and segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised RAD-DINO excels in segmentation tasks

Text-supervised CheXagent leads in classification performance

Custom segmentation model integrates global and local features

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training