Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study addresses the critical bottleneck in developing general-purpose medical foundation models—namely, the scarcity of large-scale, standardized, and high-quality medical imaging datasets. To tackle this challenge, the authors conduct a systematic survey of over 1,000 open-source medical imaging datasets and construct the first comprehensive landscape encompassing multiple imaging modalities, clinical tasks, and anatomical regions. They propose a metadata-driven fusion paradigm (MDFP) to structurally integrate these fragmented resources. The work delivers an interactive data discovery portal, a unified dataset catalog, and a scalable, highly reusable structured repository, substantially enhancing the discoverability and utilization efficiency of medical imaging data and thereby establishing a robust data foundation for future medical foundation model research.

Technology Category

Application Category

📝 Abstract

Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

Problem

Research questions and friction points this paper is trying to address.

medical imaging

foundation models

dataset scarcity

data fragmentation

open-access datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

metadata-driven fusion

medical foundation models

dataset integration