M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing medical image retrieval methods rely on modality-specific architectures, hindering cross-modal unified representation learning and scalable deployment. To address this, we propose M3Ret—the first zero-shot multimodal medical image retrieval framework that requires no paired data. M3Ret integrates Masked Autoencoder (MAE) and SimDINO self-supervised learning to construct a unified visual encoder capable of jointly representing 2D, 3D, and video medical imagery. Its core innovation lies in achieving strong cross-modal alignment, substantially improving generalization to unseen modalities (e.g., MRI). Evaluated on a large-scale, self-curated multimodal medical dataset, M3Ret significantly outperforms baselines—including DINOv3 and BMC-CLIP—in zero-shot retrieval, establishing new state-of-the-art performance. Moreover, it demonstrates exceptional scalability with respect to both data volume and model size, enabling practical deployment across diverse clinical imaging scenarios.

Technology Category

Application Category

📝 Abstract

Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations. To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms. Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities. Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.

Problem

Research questions and friction points this paper is trying to address.

Unifying fragmented architectures for 2D/3D/video medical image retrieval

Learning transferable representations without modality-specific customization

Enabling zero-shot cross-modal retrieval for unseen medical imaging tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual encoder without modality-specific customization

Combines generative and contrastive self-supervised learning paradigms

Learns transferable representations from hybrid-modality medical dataset

🔎 Similar Papers

No similar papers found.