A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

176K/year
🤖 AI Summary
In medical image segmentation guided by text, conventional data augmentation (e.g., rotation, flipping) disrupts cross-modal spatial alignment between images and text, degrading performance. To address this, we propose an early-fusion framework that projects textual embeddings into the visual space *before* augmentation and synthesizes semantically consistent, interpretable pseudo-images via a lightweight generator—thereby bridging the modality gap while preserving spatial coherence. Our method is architecture-agnostic, requiring no modifications to downstream segmentation models. Evaluated across three medical segmentation tasks and four state-of-the-art segmentation frameworks, it achieves new SOTA results. Visualizations confirm that the generated pseudo-images accurately localize target anatomical regions. The core innovations are: (1) the “pre-augmentation multimodal fusion” paradigm, and (2) a text-driven mechanism for generating interpretable, semantically grounded pseudo-images.

Technology Category

Application Category

📝 Abstract
Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.
Problem

Research questions and friction points this paper is trying to address.

Addressing data augmentation limitations in medical image segmentation
Preserving text-image spatial consistency during multimodal learning
Bridging semantic gaps between text and visual features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early fusion framework preserves spatial text-image consistency
Lightweight generator projects text embeddings into visual space
Method achieves state-of-the-art results across medical tasks
🔎 Similar Papers
No similar papers found.