🤖 AI Summary
Performance disparities in medical AI often arise from complex interactions among underrepresented data, image acquisition conditions—such as pixel spacing—and clinical or demographic factors like BMI and gestational age, making attribution challenging. This work proposes a structured analytical framework that, for the first time, incorporates acquisition parameters into fairness evaluation. By integrating unsupervised slice discovery, factorization, and cross-subgroup assessment, the approach systematically disentangles biases from multiple sources. Validation on 94,000 fetal ultrasound images demonstrates that optimizing pixel spacing can improve prediction accuracy by up to 24% for specific subgroups, with consistent gains across BMI strata. These findings underscore the critical importance of acquisition-aware evaluation in advancing fairness in medical AI.
📝 Abstract
Bias in medical AI is often framed as a problem of representation. However, in image-based tasks such as fetal ultrasound, performance disparities can arise even when representation is adequate, because predictive accuracy depends strongly on image quality. Image quality is shaped by acquisition conditions and operator expertise, as well as patient-dependent factors such as maternal body mass index (BMI), all of which may correlate with sensitive demographic features. Consequently, observed disparities may reflect the combined influence of demographic, clinical, and acquisition-related factors rather than data imbalance alone, and may obscure underlying interaction or confounding effects. We propose a structured framework to explore and detect intersectional bias, combining unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation. In a case study of over 94{,}000 ultrasound images for fetal weight estimation, we analyze bias in a state-of-the-art deep learning (DL) model and the clinical standard Hadlock, a regression formula using biometric measurements. Pixel spacing (PS) -- a parameter considered suboptimal in current acquisition protocols -- emerged as a consistent driver of performance differences, with higher PS associated with improvements of up to 24\% in selected subgroups for both models. Because PS is often adapted in cases of high BMI or low gestational age (GA), this effect carries a substantial risk of confounding. Our intersectional analysis revealed that part of the PS-associated signal is explained by GA, while PS-related improvements persist across BMI strata, highlighting the importance of acquisition-aware and interaction-aware evaluation in medical AI fairness research.