In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review

📅 2025-01-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Medical imaging datasets commonly suffer from label noise, shortcut learning, missing metadata, and challenges in retrospectively addressing newly discovered issues (e.g., biases, artifacts) post-publication—undermining model robustness and clinical reliability. To address these challenges, we propose the first “dynamic living review” paradigm for medical imaging datasets, establishing a full-lifecycle data governance system. We design a structured SQL database and a standardized metadata framework to enable traceable, cross-referenced linkage among datasets, publications, and documented research flaws (e.g., biases, annotation errors, shortcut effects). Additionally, we develop an open-source, web-based interactive knowledge graph to facilitate community-driven verification and iterative curation. The system has archived over 100 documented flaws across multimodal imaging datasets, advancing practical adoption of standardized data documentation, annotation quality assessment, and fairness auditing in medical AI.

Technology Category

Application Category

📝 Abstract

Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at http://130.226.140.142.

Problem

Research questions and friction points this paper is trying to address.

Medical Image Analysis

Dataset Quality

Algorithm Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous Review Methodology

Medical Imaging Datasets

Data Management Best Practices

🔎 Similar Papers

Bias Assessment and Data Drift Detection in Medical Image Analysis: A Survey