🤖 AI Summary
This work addresses the high annotation cost in medical referring expression segmentation and the misalignment between images and texts caused by aggressive data augmentation in semi-supervised learning. To this end, the authors propose a teacher–student semi-supervised framework that jointly leverages limited labeled and abundant unlabeled data through a cross-modal alignment mechanism. Key innovations include T-PatchMix, which performs synchronized CutMix augmentation on both images and positional referring texts; PosAug, a position-aware textual augmentation strategy; and ITCL, which constructs soft positive samples using positional pseudo-labels to enhance image–text contrastive learning. Evaluated on the QaTa-COV19 and MosMedData+ datasets, the proposed method consistently outperforms existing fully supervised and semi-supervised baselines across all labeling ratios.
📝 Abstract
Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.