🤖 AI Summary
This work systematically evaluates the transferability of pretrained CNNs (e.g., ResNet) versus vision foundation models (ViTs, DINOv2, SAM, CLIP) as feature extractors for cross-modal, few-shot medical image retrieval (CBMIR), while analyzing the impact of input image resolution. We propose a medical-image-specific feature normalization and metric fusion strategy. To our knowledge, this is the first unified benchmark evaluation of foundation models—including SAM and DINOv2—in CBMIR. Experiments show that DINOv2 achieves R@10 = 72.3% on the RSNA CXR dataset without fine-tuning, substantially outperforming conventional CNNs; enables effective cross-domain retrieval; and accelerates inference by 3.2×. t-SNE visualization and linear probing further confirm its robustness and strong transferability.