In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review

📅 2025-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical imaging datasets commonly suffer from label noise, shortcut learning, missing metadata, and challenges in retrospectively addressing newly discovered issues (e.g., biases, artifacts) post-publication—undermining model robustness and clinical reliability. To address these challenges, we propose the first “dynamic living review” paradigm for medical imaging datasets, establishing a full-lifecycle data governance system. We design a structured SQL database and a standardized metadata framework to enable traceable, cross-referenced linkage among datasets, publications, and documented research flaws (e.g., biases, annotation errors, shortcut effects). Additionally, we develop an open-source, web-based interactive knowledge graph to facilitate community-driven verification and iterative curation. The system has archived over 100 documented flaws across multimodal imaging datasets, advancing practical adoption of standardized data documentation, annotation quality assessment, and fairness auditing in medical AI.

Technology Category

Application Category

📝 Abstract
Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at http://130.226.140.142.
Problem

Research questions and friction points this paper is trying to address.

Medical Image Analysis
Dataset Quality
Algorithm Performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous Review Methodology
Medical Imaging Datasets
Data Management Best Practices
A
Amelia Jim'enez-S'anchez
IT University of Copenhagen, Denmark
Natalia-Rozalia Avlona
Natalia-Rozalia Avlona
Postdoctoral Researcher, University of Copenhagen
Medical AIAI Regulation & GovernanceData InfrastructuresSociotechnical Ethnography
S
Sarah de Boer
Radboud University Medical Center, The Netherlands
V
V'ictor M. Campello
Universitat de Barcelona, Spain
Aasa Feragen
Aasa Feragen
Professor, DTU Compute
Machine learningmedical imaginggeometric modelling
Enzo Ferrante
Enzo Ferrante
CONICET & Universidad de Buenos Aires
Medical ImagingMachine LearningComputer VisionML Fairness
Melanie Ganz
Melanie Ganz
Department of Computer Science, University of Copenhagen
Medical Image Analysis
Judy Wawira Gichoya
Judy Wawira Gichoya
Emory University
Health informaticsRadiologyArtificial IntelligenceGlobal HealthFAIR AI
C
Camila Gonz'alez
Stanford University, USA
S
Steff Groefsema
University of Groningen, The Netherlands
Alessa Hering
Alessa Hering
Radboud University Medical Center
Deep LearningImage RegistrationTumor Follow-UpLLM
A
Adam Hulman
Steno Diabetes Center Aarhus, Aarhus University Hospital, Denmark and Department of Public Health, Aarhus University, Denmark
Leo Joskowicz
Leo Joskowicz
Professor of Computer Science, The Hebrew University of Jerusalem, Israel.
Medical Image ProcessingComputer Aided SurgeryComputational GeometryRobotics
Dovile Juodelyte
Dovile Juodelyte
PhD Fellow, IT University of Copenhagen
Data ScienceMachine LearningTransfer LearningMedical Imaging
Melih Kandemir
Melih Kandemir
Associate Professor of Machine Learning at the University of Southern Denmark
Bayesian InferenceNeural Stochastic ProcessesDynamics ModelingReinforcement Learning
Thijs Kooi
Thijs Kooi
Lunit Inc.
Machine LearningMedical Image AnalysisComputer aided diagnosis
J
Jorge del Pozo L'erida
IT University of Copenhagen & Cerebriu A/S, Denmark
L
Livie Yumeng Li
Steno Diabetes Center Aarhus, Aarhus University Hospital & Department of Public Health, Aarhus University, Denmark
A
Andre Pacheco
Federal University of Espírito Santo, Brazil
T
Tim Radsch
Division of Intelligent Medical Systems, German Cancer Research Center, Germany, Helmholtz Imaging, German Cancer Research Center, Germany, and Engineering Faculty, Heidelberg University, Germany
Mauricio Reyes
Mauricio Reyes
Medical Image Analysis, ARTORG Center, Univ. Bern
Medical Image AnalysisBiomedical Engineering
T
Th'eo Sourget
IT University of Copenhagen, Denmark
Bram van Ginneken
Bram van Ginneken
Professor of Medical Image Analysis, Radboud University
Medical Image AnalysisMedical ImagingDeep LearningComputer-Aided Diagnosis
D
David Wen
Department of Dermatology, Churchill Hospital, Oxford University Hospitals, UK
N
Nina Weng
Technical University of Denmark, Denmark