🤖 AI Summary
This study systematically evaluates three vision-language foundation models—RAD-DINO (self-supervised), CheXagent (text-supervised), and BiomedCLIP—on pneumothorax and cardiomegaly tasks in chest X-rays, assessing their performance across classification, segmentation, and regression.
Method: We analyze how pretraining paradigms influence task-specific efficacy, revealing that RAD-DINO excels in fine-grained segmentation due to its text-free representation learning, whereas CheXagent achieves superior classification accuracy and interpretability via textual guidance. Leveraging these insights, we propose a lightweight, task-customized segmentation architecture integrating global and local features.
Contribution/Results: Our architecture boosts mean Intersection-over-Union (mIoU) by 12.3% on average across all baseline models, notably improving segmentation of challenging cases such as pneumothorax. This work provides the first empirical, multi-task, multi-paradigm guideline for selecting radiology AI models and uncovers a principled correspondence between pretraining paradigms and the granularity of downstream medical imaging tasks.
📝 Abstract
Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.