A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In medical image segmentation guided by text, conventional data augmentation (e.g., rotation, flipping) disrupts cross-modal spatial alignment between images and text, degrading performance. To address this, we propose an early-fusion framework that projects textual embeddings into the visual space *before* augmentation and synthesizes semantically consistent, interpretable pseudo-images via a lightweight generator—thereby bridging the modality gap while preserving spatial coherence. Our method is architecture-agnostic, requiring no modifications to downstream segmentation models. Evaluated across three medical segmentation tasks and four state-of-the-art segmentation frameworks, it achieves new SOTA results. Visualizations confirm that the generated pseudo-images accurately localize target anatomical regions. The core innovations are: (1) the “pre-augmentation multimodal fusion” paradigm, and (2) a text-driven mechanism for generating interpretable, semantically grounded pseudo-images.

Technology Category

Application Category

📝 Abstract
Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.
Problem

Research questions and friction points this paper is trying to address.

Addressing data augmentation limitations in medical image segmentation
Preserving text-image spatial consistency during multimodal learning
Bridging semantic gaps between text and visual features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early fusion framework preserves spatial text-image consistency
Lightweight generator projects text embeddings into visual space
Method achieves state-of-the-art results across medical tasks
🔎 Similar Papers
No similar papers found.
Shurong Chai
Shurong Chai
Ritsumeikan university
Computer vision
R
Rahul Kumar JAIN
Tiwaki Co., Ltd., Kusatsu, Japan
R
Rui Xu
School of Software, Dalian University of Technology, Dalian, China
S
Shaocong Mo
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Ruibo Hou
Ruibo Hou
UIUC
NLPAI4SCIENCE
Shiyu Teng
Shiyu Teng
Ph.D. Student of Ritsumeikan University
depression detectionmultimodal learning
Jiaqing Liu
Jiaqing Liu
Renmin University of China
Natural Language ProcessingDeep LearningMachine LearningFinance
L
Lanfen Lin
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Yen-Wei Chen
Yen-Wei Chen
Ritsumeikan University
image processingpattern recognitionmedical image analysis