🤖 AI Summary
Medical imaging supervised pretraining is hindered by the scarcity of expert annotations, while readily available metadata—such as imaging modality and anatomical region—remain underutilized. To address this, we propose ModAn-MulSupCon, the first method to encode modality and anatomical information as multi-hot vectors and introduce a Jaccard-weighted multi-label contrastive loss for metadata-driven self-supervised representation learning. We pretrain a ResNet-18 backbone on the miniRIN dataset and evaluate transfer performance via linear probing and fine-tuning. On ACL tear detection, ModAn-MulSupCon achieves an AUC of 0.964; on thyroid nodule malignancy classification, it attains an AUC of 0.763—both significantly outperforming established baselines. These results demonstrate superior cross-task transferability and generalization under extreme label scarcity, validating the efficacy of leveraging structured metadata for self-supervised medical image representation learning.
📝 Abstract
Background and objective: Expert annotations limit large-scale supervised pretraining in medical imaging, while ubiquitous metadata (modality, anatomical region) remain underused. We introduce ModAn-MulSupCon, a modality- and anatomy-aware multi-label supervised contrastive pretraining method that leverages such metadata to learn transferable representations.
Method: Each image's modality and anatomy are encoded as a multi-hot vector. A ResNet-18 encoder is pretrained on a mini subset of RadImageNet (miniRIN, 16,222 images) with a Jaccard-weighted multi-label supervised contrastive loss, and then evaluated by fine-tuning and linear probing on three binary classification tasks--ACL tear (knee MRI), lesion malignancy (breast ultrasound), and nodule malignancy (thyroid ultrasound).
Result: With fine-tuning, ModAn-MulSupCon achieved the best AUC on MRNet-ACL (0.964) and Thyroid (0.763), surpassing all baselines ($p<0.05$), and ranked second on Breast (0.926) behind SimCLR (0.940; not significant). With the encoder frozen, SimCLR/ImageNet were superior, indicating that ModAn-MulSupCon representations benefit most from task adaptation rather than linear separability.
Conclusion: Encoding readily available modality/anatomy metadata as multi-label targets provides a practical, scalable pretraining signal that improves downstream accuracy when fine-tuning is feasible. ModAn-MulSupCon is a strong initialization for label-scarce clinical settings, whereas SimCLR/ImageNet remain preferable for frozen-encoder deployments.