MedCutMix: A Data-Centric Approach to Improve Radiology Vision-Language Pre-training with Disease Awareness

๐Ÿ“… 2025-09-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Medical vision-language pretraining (VLP) is hindered by the high annotation cost and privacy constraints associated with radiology image-text pairs, resulting in data scarcity and limited semantic diversity in augmentation strategies. To address this, we propose a disease-centric multimodal data augmentation framework: (1) sentence-level CutMix guided by diagnostic descriptions; (2) cross-modal semantic alignment via disease-aware attention; and (3) attentive manifold mixโ€”attention-guided interpolation in the visual feature latent space. Our approach is the first to explicitly inject disease semantics into cross-modal augmentation, balancing semantic consistency and pathological diversity. Evaluated on four downstream radiological diagnosis tasks, it significantly outperforms existing VLP methods, demonstrating improved model generalization and clinical semantic understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language Pre-training (VLP) is drawing increasing interest for its ability to minimize manual annotation requirements while enhancing semantic understanding in downstream tasks. However, its reliance on image-text datasets poses challenges due to privacy concerns and the high cost of obtaining paired annotations. Data augmentation emerges as a viable strategy to address this issue, yet existing methods often fall short of capturing the subtle and complex variations in medical data due to limited diversity. To this end, we propose MedCutMix, a novel multi-modal disease-centric data augmentation method. MedCutMix performs diagnostic sentence CutMix within medical reports and establishes the cross-attention between the diagnostic sentence and medical image to guide attentive manifold mix within the imaging modality. Our approach surpasses previous methods across four downstream radiology diagnosis datasets, highlighting its effectiveness in enhancing performance and generalizability in radiology VLP.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited diversity in medical image-text datasets for VLP
Improving disease-aware data augmentation for radiology VLP tasks
Enhancing semantic understanding while minimizing manual annotation costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disease-centric multimodal data augmentation method
Diagnostic sentence CutMix with cross-attention guidance
Attentive manifold mix between medical images and text
๐Ÿ”Ž Similar Papers
No similar papers found.
Sinuo Wang
Sinuo Wang
PhD Candidate, The University of Adelaide
Vision-Language Machine Learning
Y
Yutong Xie
Mohamed bin Zayed University of Artificial Intelligence
Y
Yuyuan Liu
University of Oxford
Q
Qi Wu
The University of Adelaide