🤖 AI Summary
Current 3D medical self-supervised learning (mSSL) methods predominantly employ fixed-size patch partitioning, overlooking the inherent heterogeneity of anatomical structures in spatial location, scale, and morphology—leading to coarse-grained and insufficiently discriminative semantic representations. To address this, we propose a structure-aware joint learning framework that simultaneously optimizes semantic disparity and consistency. Our approach introduces, for the first time, structure-level semantic consistency constraints and inter-patch semantic disparity optimization. We model cross-regional semantic discrimination via optimal transport and enhance intra-structural semantic consistency by leveraging neighborhood similarity distributions. Furthermore, we enforce alignment between patch-level and structure-level representations. Extensive evaluation across 10 datasets, 4 downstream tasks, and 3 medical imaging modalities demonstrates consistent and significant improvements over state-of-the-art methods, with enhanced generalizability and robustness.
📝 Abstract
3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.