🤖 AI Summary
To address critical challenges in AI healthcare deployment—namely, lack of clinical accountability, insufficient generalizability, and workflow misalignment—this study proposes the first clinical-accountability-oriented AI deployment framework, using age-related macular degeneration (AMD) diagnosis and grading as a use case. Methodologically, it integrates deep learning fine-tuning, human-AI collaborative clinical workflow design, multicenter prospective validation, and a continuous learning paradigm grounded in AREDS2 data, complemented by external generalizability assessment across geographic sites (Singapore). Results demonstrate that AI assistance increased the mean F1-score of 23 out of 24 clinicians by 20 percentage points (from 37.71 to 45.52), with maximum improvements exceeding 50%; diagnosis time was reduced by up to 40% for 17 out of 19 clinicians; and the continuously trained model achieved an F1-score of 54 on the external cohort—a 29-percentage-point absolute gain.
📝 Abstract
Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and severity classification. We designed and implemented an AI-assisted diagnostic workflow for AMD, comparing diagnostic performance with and without AI assistance among 24 clinicians from 12 institutions with real patient data sampled from the Age-Related Eye Disease Study (AREDS). Additionally, we demonstrated continual enhancement of an existing AI model by incorporating approximately 40,000 additional medical images (named AREDS2 dataset). The improved model was then systematically evaluated using both AREDS and AREDS2 test sets, as well as an external test set from Singapore. AI assistance markedly enhanced diagnostic accuracy and classification for 23 out of 24 clinicians, with the average F1-score increasing by 20% from 37.71 (Manual) to 45.52 (Manual + AI) (P-value<0.0001), achieving an improvement of over 50% in some cases. In terms of efficiency, AI assistance reduced diagnostic times for 17 out of the 19 clinicians tracked, with time savings of up to 40%. Furthermore, a model equipped with continual learning showed robust performance across three independent datasets, recording a 29% increase in accuracy, and elevating the F1-score from 42 to 54 in the Singapore population.