🤖 AI Summary
This work addresses a critical limitation in existing self-supervised pretraining methods for medical imaging—their neglect of the stable spatial topological relationships inherent across anatomical structures among individuals, which constitute a key physiological prior. To leverage this prior, the study introduces inter-individual anatomical topological consistency into 3D multimodal self-supervised learning for the first time, proposing a dual-alignment mechanism. When pixel-wise correspondences are available, local neighborhood topology is preserved via a cross-modal triplet loss; in their absence, pseudo-correspondences are constructed to enable partial alignment and prevent topological collapse across modalities. This approach transcends the constraints of conventional instance-level contrastive learning, yielding average performance gains of 1.1% in segmentation and 5.94% in classification across seven downstream tasks, while significantly enhancing robustness under missing-modality test scenarios.
📝 Abstract
Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.