🤖 AI Summary
Multimodal medical imaging suffers from misalignment across modalities due to the absence of paired registration data between arbitrary modality pairs. Method: This paper proposes M³Bind, the first framework to achieve joint multimodal alignment without explicit inter-modal pairing by leveraging text as a shared semantic mediator. Built upon the CLIP architecture, it introduces modality-specific textual space fine-tuning and knowledge distillation to construct a unified, shared text encoder—preserving each modality’s original image–text alignment capability while enabling zero-shot and few-shot cross-modal retrieval and classification. Results: Extensive evaluation across X-ray, CT, fundus, ECG, and histopathology images demonstrates state-of-the-art performance across multiple tasks, with significant improvements in zero-shot classification accuracy and cross-modal retrieval recall.
📝 Abstract
Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning. Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis. While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M extsuperscript{3}Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities. Specifically, based on the insight that different images can naturally bind with text, M extsuperscript{3}Bind first fine-tunes pre-trained CLIP-like image-text models to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space. Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M extsuperscript{3}Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts. These results validate M extsuperscript{3}Bind's effectiveness in achieving cross-image-modal alignment for medical analysis.