Multimodal Medical Image Binding via Shared Text Embeddings

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal medical imaging suffers from misalignment across modalities due to the absence of paired registration data between arbitrary modality pairs. Method: This paper proposes M³Bind, the first framework to achieve joint multimodal alignment without explicit inter-modal pairing by leveraging text as a shared semantic mediator. Built upon the CLIP architecture, it introduces modality-specific textual space fine-tuning and knowledge distillation to construct a unified, shared text encoder—preserving each modality’s original image–text alignment capability while enabling zero-shot and few-shot cross-modal retrieval and classification. Results: Extensive evaluation across X-ray, CT, fundus, ECG, and histopathology images demonstrates state-of-the-art performance across multiple tasks, with significant improvements in zero-shot classification accuracy and cross-modal retrieval recall.

Technology Category

Application Category

📝 Abstract
Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning. Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis. While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M extsuperscript{3}Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities. Specifically, based on the insight that different images can naturally bind with text, M extsuperscript{3}Bind first fine-tunes pre-trained CLIP-like image-text models to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space. Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M extsuperscript{3}Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts. These results validate M extsuperscript{3}Bind's effectiveness in achieving cross-image-modal alignment for medical analysis.
Problem

Research questions and friction points this paper is trying to address.

Aligning diverse medical imaging modalities without paired data
Creating shared text embedding space for multimodal integration
Improving zero-shot and cross-modal retrieval in medical analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns medical images via shared text embeddings
Eliminates need for explicit paired modality data
Distills modality-specific encoders into unified model
🔎 Similar Papers
No similar papers found.
Yunhao Liu
Yunhao Liu
ACM Fellow, IEEE Fellow, CCF Fellow, Tsinghua University
Wireless Sensor Networks/RFIDCyber Physical Systems and IoTPrivacy and SecurityCloud Computing
S
Suyang Xi
Emory University, Atlanta, USA
S
Shiqi Liu
The University of Hong Kong, Hong Kong, China
Hong Ding
Hong Ding
Tsung-Dao Lee Institute, Shanghai Jiao Tong University
condensed matter physics
C
Chicheng Jin
University of Science and Technology of China, Hefei, China
C
Chenxi Yang
University of Electronic Science and Technology of China, Chengdu, China
Junjun He
Junjun He
Shanghai Jiao Tong University
Yiqing Shen
Yiqing Shen
Johns Hopkins