Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography

📅 2024-09-26
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of label scarcity, minute lesion regions, and severe class imbalance in mammography—hindering effective adaptation of CLIP models—this paper proposes MaMA, the first end-to-end CLIP pretraining framework tailored for mammographic imaging. Methodologically, MaMA introduces a novel multi-view supervised contrastive learning strategy coupled with a symmetric local alignment module; integrates medical-knowledge-enhanced parameter-efficient fine-tuning of large language models; and incorporates a high-resolution local attention mechanism for image encoding. Evaluated on EMBED and RSNA-Mammo across classification, cross-modal retrieval, and zero-shot diagnosis tasks, MaMA consistently outperforms all existing state-of-the-art methods with substantial performance gains. Notably, its model size is only 52% of the largest baseline, achieving both computational efficiency and clinical practicality.

Technology Category

Application Category

📝 Abstract
Contrastive Language-Image Pre-training (CLIP) demonstrates strong potential in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities underexplored. Here, we propose one of the first adaptations of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and class-wise imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline. The code is available at https://github.com/XYPB/MaMA
Problem

Research questions and friction points this paper is trying to address.

Adapting CLIP to mammography with limited labeled data
Handling high-resolution mammograms with small regions of interest
Addressing class imbalance in mammography image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized supervision framework for mammography
Symmetric local alignment for high-resolution images
Parameter-efficient fine-tuning for medical language models
🔎 Similar Papers
No similar papers found.
Yuexi Du
Yuexi Du
PhD candidate @ Yale University
Computer VisionMedical Image AnalysisMulti-modal Learning
J
John Onofrey
Department of Biomedical Engineering, Department of Radiology & Biomedical Imaging, Department of Urology, Yale University, New Haven, CT, USA
N
N. Dvornek
Department of Biomedical Engineering, Department of Radiology & Biomedical Imaging, Yale University, New Haven, CT, USA