🤖 AI Summary
Current medical foundation models (MFMs) lack systematic evaluation of feature quality and adaptability for fine-grained thoracic X-ray analysis, particularly in classification and anatomical structure segmentation.
Method: We systematically benchmark eight vision foundation models—spanning medical versus general pretraining, multi-scale versus modality-aligned architectures, and text-guided versus image-supervised alignment—using linear probing, full fine-tuning, and subgroup analysis on standard radiology datasets.
Results: Medical pretraining substantially improves linear probing performance but fails to eliminate the need for fine-tuning in subtle lesion segmentation. Text-image alignment is unnecessary; label-supervised or purely image-based pretraining yields superior segmentation accuracy. Multi-scale architectural design proves more decisive than cross-modal alignment. Critically, we reveal an intrinsic limitation of state-of-the-art MFMs in complex spatial localization tasks; meanwhile, supervised end-to-end models now match or surpass leading foundation models in segmentation precision—challenging prevailing assumptions about the necessity of foundation-model paradigms for medical imaging segmentation.
📝 Abstract
Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.