🤖 AI Summary
Medical image retrieval faces two key challenges: (1) the definition of “similarity” varies across clinical contexts, and (2) the scarcity of large-scale, high-quality benchmark datasets with fine-grained, anatomy-conditioned relevance annotations. To address these, we propose a scalable, multi-granularity medical image retrieval framework. Our method introduces the first automatic similarity annotation paradigm grounded in semantic parsing of radiology reports, enabling anatomy-aware, conditional, fine-grained image ranking across anatomical structures. We construct MIMIC-IR—the first large-scale X-ray retrieval dataset—and CTRATE-IR—the first large-scale CT retrieval dataset—both annotated with anatomy-conditioned relevance rankings. Furthermore, we design an anatomy-aware cross-modal alignment module and a multi-granularity contrastive learning architecture. Our RadIR-CXR and ChestCT models achieve state-of-the-art performance on both image-to-image and image-to-report retrieval tasks, outperforming prior methods on 77 of 78 evaluation metrics, demonstrating substantial improvements in anatomy-conditioned retrieval accuracy.
📝 Abstract
Developing advanced medical imaging retrieval systems is challenging due to the varying definitions of `similar images' across different medical contexts. This challenge is compounded by the lack of large-scale, high-quality medical imaging retrieval datasets and benchmarks. In this paper, we propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities in a scalable and fully automatic manner. Using this approach, we construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans, providing detailed image-image ranking annotations conditioned on diverse anatomical structures. Furthermore, we develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks. These systems also enable flexible, effective image retrieval conditioned on specific anatomical structures described in text, achieving state-of-the-art results on 77 out of 78 metrics.