🤖 AI Summary
In medical image segmentation guided by text, conventional data augmentation (e.g., rotation, flipping) disrupts cross-modal spatial alignment between images and text, degrading performance. To address this, we propose an early-fusion framework that projects textual embeddings into the visual space *before* augmentation and synthesizes semantically consistent, interpretable pseudo-images via a lightweight generator—thereby bridging the modality gap while preserving spatial coherence. Our method is architecture-agnostic, requiring no modifications to downstream segmentation models. Evaluated across three medical segmentation tasks and four state-of-the-art segmentation frameworks, it achieves new SOTA results. Visualizations confirm that the generated pseudo-images accurately localize target anatomical regions. The core innovations are: (1) the “pre-augmentation multimodal fusion” paradigm, and (2) a text-driven mechanism for generating interpretable, semantically grounded pseudo-images.
📝 Abstract
Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.