🤖 AI Summary
Medical AI models for CT analysis often suffer from domain shift due to variations in CT scanners, reconstruction algorithms, and radiation dose protocols—severely limiting cross-center generalizability. To address this, we introduce the first open-source, multi-center CT benchmark dataset built upon anthropomorphic 3D phantoms, encompassing data from 13 scanners across 4 vendors and 8 institutions, with anatomical variability rigorously controlled to eliminate inter-subject confounding. The dataset comprises 1,378 multi-parametric CT series acquired under diverse dose levels and reconstruction protocols. We further provide an open-source evaluation framework enabling quantitative assessment of harmonization performance at both image- and feature-levels. Experiments on liver tissue classification demonstrate that our benchmark significantly improves cross-device image consistency and radiomic feature stability. This work establishes a reproducible, standardized testing platform for advancing robustness research in medical AI.
📝 Abstract
Artificial intelligence (AI) has introduced numerous opportunities for human assistance and task automation in medicine. However, it suffers from poor generalization in the presence of shifts in the data distribution. In the context of AI-based computed tomography (CT) analysis, significant data distribution shifts can be caused by changes in scanner manufacturer, reconstruction technique or dose. AI harmonization techniques can address this problem by reducing distribution shifts caused by various acquisition settings. This paper presents an open-source benchmark dataset containing CT scans of an anthropomorphic phantom acquired with various scanners and settings, which purpose is to foster the development of AI harmonization techniques. Using a phantom allows fixing variations attributed to inter- and intra-patient variations. The dataset includes 1378 image series acquired with 13 scanners from 4 manufacturers across 8 institutions using a harmonized protocol as well as several acquisition doses. Additionally, we present a methodology, baseline results and open-source code to assess image- and feature-level stability and liver tissue classification, promoting the development of AI harmonization strategies.