🤖 AI Summary
This study systematically evaluates the transferability of nine self-supervised learning (SSL) methods for 3D medical image segmentation, with a particular focus on label-scarce scenarios. Leveraging a unified CT pretraining protocol and the SwinUNETR architecture, the authors fine-tune models across nine diverse CT and MRI segmentation tasks, analyzing convergence speed, cross-modality transferability, and feature reuse patterns through Dice scores and Centered Kernel Alignment (CKA). This large-scale, first-of-its-kind comparative study reveals that SMIT achieves superior overall performance in terms of accuracy, convergence speed, and few-shot stability. Furthermore, mask image modeling (MIM) and self-distillation consistently outperform contrastive learning and rotation prediction, underscoring the critical advantage of local representation learning under data-limited conditions.
📝 Abstract
Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment.
Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.