🤖 AI Summary
Current medical AI lacks standardized, multimodal (text–image) retrieval evaluation benchmarks. To address this gap, we introduce M3Retrieve—the first large-scale, multidisciplinary, and task-diverse medical multimodal retrieval benchmark. It spans five major medical domains and 16 specialties, comprising over 1.2 million text documents and 164,000 cross-modal queries, supporting four realistic clinical tasks: cross-modal question answering, retrieval, summarization, and alignment. Built from compliant, authorized data, M3Retrieve provides rigorously aligned text–image pairs and employs a unified evaluation protocol to systematically assess state-of-the-art multimodal models. Our evaluation reveals critical bottlenecks in domain expertise, cross-modal alignment fidelity, and scalability. All resources—including datasets, evaluation frameworks, and baseline implementations—are publicly released, establishing the first comprehensive open benchmark for medical multimodal retrieval evaluation.
📝 Abstract
With the increasing use of RetrievalAugmented Generation (RAG), strong retrieval models have become more important than ever. In healthcare, multimodal retrieval models that combine information from both text and images offer major advantages for many downstream tasks such as question answering, cross-modal retrieval, and multimodal summarization, since medical data often includes both formats. However, there is currently no standard benchmark to evaluate how well these models perform in medical settings. To address this gap, we introduce M3Retrieve, a Multimodal Medical Retrieval Benchmark. M3Retrieve, spans 5 domains,16 medical fields, and 4 distinct tasks, with over 1.2 Million text documents and 164K multimodal queries, all collected under approved licenses. We evaluate leading multimodal retrieval models on this benchmark to explore the challenges specific to different medical specialities and to understand their impact on retrieval performance. By releasing M3Retrieve, we aim to enable systematic evaluation, foster model innovation, and accelerate research toward building more capable and reliable multimodal retrieval systems for medical applications. The dataset and the baselines code are available in this github page https://github.com/AkashGhosh/M3Retrieve.