๐ค AI Summary
This work addresses the suboptimal transfer performance of vision foundation models on downstream tasks, often stemming from a mismatch between pretraining objectives and task-specific requirements. Focusing on prostate multiparametric MRI analysis, the study compares a reconstruction-based foundation model (ProFound) with a contrastive learning-based counterpart (ProViCNet). It proposes using lightweight metricsโsuch as Maximum Mean Discrepancy (MMD)โto quantitatively assess feature alignment between pretraining and downstream task representations. Empirical results demonstrate that higher alignment correlates strongly with improved transfer performance and faster convergence. These findings offer both theoretical insight and practical guidance for designing pretraining objectives tailored to specific downstream applications.
๐ Abstract
Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.