A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

In medical image segmentation guided by text, conventional data augmentation (e.g., rotation, flipping) disrupts cross-modal spatial alignment between images and text, degrading performance. To address this, we propose an early-fusion framework that projects textual embeddings into the visual space *before* augmentation and synthesizes semantically consistent, interpretable pseudo-images via a lightweight generator—thereby bridging the modality gap while preserving spatial coherence. Our method is architecture-agnostic, requiring no modifications to downstream segmentation models. Evaluated across three medical segmentation tasks and four state-of-the-art segmentation frameworks, it achieves new SOTA results. Visualizations confirm that the generated pseudo-images accurately localize target anatomical regions. The core innovations are: (1) the “pre-augmentation multimodal fusion” paradigm, and (2) a text-driven mechanism for generating interpretable, semantically grounded pseudo-images.

Technology Category

Application Category

📝 Abstract

Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.

Problem

Research questions and friction points this paper is trying to address.

Addressing data augmentation limitations in medical image segmentation

Preserving text-image spatial consistency during multimodal learning

Bridging semantic gaps between text and visual features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early fusion framework preserves spatial text-image consistency

Lightweight generator projects text embeddings into visual space

Method achieves state-of-the-art results across medical tasks

🔎 Similar Papers

No similar papers found.