🤖 AI Summary
Critical gaps exist in empirical evidence regarding the performance of AI models for cardiac ultrasound across sex, race, and ethnicity subgroups. Method: This study first systematically assessed the completeness of sociodemographic reporting across six publicly available echocardiography datasets—including TMED-2—and conducted cross-subgroup performance evaluation of published deep learning models for aortic stenosis detection. Results: Most datasets exhibited underrepresentation of female participants, low coverage of racial/ethnic minorities, and frequent absence of sociodemographic annotations. Deployed models demonstrated statistically significant performance disparities across subgroups and lacked formal fairness validation. Contribution: The study introduces a novel, comprehensive framework for evaluating subgroup validity in echocardiographic AI—explicitly integrating sociodemographic variables throughout model development, training, validation, and reporting. This framework establishes a methodological foundation and empirical benchmark for pre-deployment algorithmic fairness assessment in clinical AI systems.
📝 Abstract
Echocardiogram datasets enable training deep learning models to automate interpretation of cardiac ultrasound, thereby expanding access to accurate readings of diagnostically-useful images. However, the gender, sex, race, and ethnicity of the patients in these datasets are underreported and subgroup-specific predictive performance is unevaluated. These reporting deficiencies raise concerns about subgroup validity that must be studied and addressed before model deployment. In this paper, we show that current open echocardiogram datasets are unable to assuage subgroup validity concerns. We improve sociodemographic reporting for two datasets: TMED-2 and MIMIC-IV-ECHO. Analysis of six open datasets reveals no consideration of gender-diverse patients and insufficient patient counts for many racial and ethnic groups. We further perform an exploratory subgroup analysis of two published aortic stenosis detection models on TMED-2. We find insufficient evidence for subgroup validity for sex, racial, and ethnic subgroups. Our findings highlight that more data for underrepresented subgroups, improved demographic reporting, and subgroup-focused analyses are needed to prove subgroup validity in future work.