🤖 AI Summary
AI models for medical imaging exhibit degraded generalizability in real-world settings—up to 20% performance drop—raising concerns regarding reproducibility and clinical trustworthiness. To address this, we propose a novel “virtual imaging trial” paradigm that integrates multicenter real-world data with physics-based synthetic CT and chest X-ray (CXR) images. For the first time, this framework quantitatively disentangles the effects of disease severity, imaging modality (CT vs. CXR), and radiation dose on AI model generalizability. Results demonstrate that CT consistently outperforms CXR; disease extent is the dominant factor influencing performance, whereas radiation dose exhibits negligible impact. By transcending the limitations of purely empirical clinical evaluation, our approach establishes a controlled, reproducible, and interpretable protocol for quantifying AI robustness. This provides a methodological foundation for standardized validation and clinical translation of radiology AI systems.
📝 Abstract
The credibility of AI models in medical imaging is often challenged by reproducibility issues and obscured clinical insights, a reality highlighted during the COVID-19 pandemic by many reports of near-perfect artificial intelligence (AI) models that all failed to generalize. To address these concerns, we propose a virtual imaging trial framework, employing a diverse collection of medical images that are both clinical and simulated. In this study, COVID-19 serves as a case example to unveil the intrinsic and extrinsic factors influencing AI performance. Our findings underscore a significant impact of dataset characteristics on AI efficacy. Even when trained on large, diverse clinical datasets with thousands of patients, AI performance plummeted by up to 20% in generalization. However, virtual imaging trials offer a robust platform for objective assessment, unveiling nuanced insights into the relationships between patient- and physics-based factors and AI performance. For instance, disease extent markedly influenced AI efficacy, computed tomography (CT) out-performed chest radiography (CXR), while imaging dose exhibited minimal impact. Using COVID-19 as a case study, this virtual imaging trial study verified that radiology AI models often suffer from a reproducibility crisis. Virtual imaging trials not only offered a solution for objective performance assessment but also extracted several clinical insights. This study illuminates the path for leveraging virtual imaging to augment the reliability, transparency, and clinical relevance of AI in medical imaging.