Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

๐Ÿ“… 2026-03-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the critical bottleneck in developing general-purpose medical foundation modelsโ€”namely, the scarcity of large-scale, standardized, and high-quality medical imaging datasets. To tackle this challenge, the authors conduct a systematic survey of over 1,000 open-source medical imaging datasets and construct the first comprehensive landscape encompassing multiple imaging modalities, clinical tasks, and anatomical regions. They propose a metadata-driven fusion paradigm (MDFP) to structurally integrate these fragmented resources. The work delivers an interactive data discovery portal, a unified dataset catalog, and a scalable, highly reusable structured repository, substantially enhancing the discoverability and utilization efficiency of medical imaging data and thereby establishing a robust data foundation for future medical foundation model research.
๐Ÿ“ Abstract
Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.
Problem

Research questions and friction points this paper is trying to address.

medical imaging
foundation models
dataset scarcity
data fragmentation
open-access datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

metadata-driven fusion
medical foundation models
dataset integration
open-access medical imaging
interactive discovery portal
Zhongying Deng
Zhongying Deng
University of Cambridge
Deep LearningMulti-modal LearningComputer VisionMedical Image Analysis
C
Cheng Tang
Shanghai Artificial Intelligence Laboratory
Z
Ziyan Huang
Shanghai Artificial Intelligence Laboratory
J
Jiashi Lin
Shanghai Artificial Intelligence Laboratory
Y
Ying Chen
Shanghai Artificial Intelligence Laboratory
J
Junzhi Ning
Shanghai Artificial Intelligence Laboratory
Chenglong Ma
Chenglong Ma
Fudan University; Shanghai Innovation Institute
multi-modal modelsgenerative modelsmedical image analysis
J
Jiyao Liu
Shanghai Artificial Intelligence Laboratory
W
Wei Li
Shanghai Artificial Intelligence Laboratory
Yinghao Zhu
Yinghao Zhu
The University of Hong Kong
Data MiningAI for Healthcare
S
Shujian Gao
Shanghai Artificial Intelligence Laboratory
Yanyan Huang
Yanyan Huang
University of Hong Kong
Medical Image AnalysisComputer VisionComputational Pathology
S
Sibo Ju
Fuzhou University
Yanzhou Su
Yanzhou Su
FZU, UESTC
medical image analysis
P
Pengcheng Chen
Shanghai Artificial Intelligence Laboratory
W
Wenhao Tang
Shanghai Artificial Intelligence Laboratory
Tianbin Li
Tianbin Li
Shanghai Artificial Intelligence Laboratory
Machine LearningComputer VisionGeneral Intelligence
H
Haoyu Wang
Shanghai Artificial Intelligence Laboratory
Yuanfeng Ji
Yuanfeng Ji
Stanford; HKU
Computer visionMedical Image Analysis
H
Hui Sun
Shanghai Artificial Intelligence Laboratory
Shaobo Min
Shaobo Min
Tencent Data Platform
multi-modal understanding
Liang Peng
Liang Peng
The University of Hong Kong
AI in HealthcareMultimodal machine learning
F
Feilong Tang
Shanghai Artificial Intelligence Laboratory
H
Haochen Xue
Shanghai Artificial Intelligence Laboratory
Rulin Zhou
Rulin Zhou
The Chinese University of Hong Kong Shenzhen Research Institute
Deep LearningMedical Image Processing