🤖 AI Summary
AI-based breast cancer screening models frequently exhibit degraded performance and bias across demographic subgroups—particularly those defined by age, race, and four other attributes—under real-world clinical deployment.
Method: We propose the first clinically deployable dynamic fairness monitoring framework, integrating multidimensional subgroup performance evaluation, subgroup drift detection, and a real-time alerting mechanism based on adaptive thresholding and trend analysis.
Contribution/Results: Our framework uniquely bridges statistical drift detection with clinically actionable alerts, enabling continuous, automated bias identification. Evaluated on the EMBED and RSNA 2022 datasets, it successfully identifies multiple significantly underperforming subgroups and detects performance degradation with an average alerting latency of less than seven days. This provides a practical, scalable technical pathway for ensuring fairness in deployed AI systems.
📝 Abstract
Automated mammography screening plays an important role in early breast cancer detection. However, current machine learning models, developed on some training datasets, may exhibit performance degradation and bias when deployed in real-world settings. In this paper, we analyze the performance of high-performing AI models on two mammography datasets-the Emory Breast Imaging Dataset (EMBED) and the RSNA 2022 challenge dataset. Specifically, we evaluate how these models perform across different subgroups, defined by six attributes, to detect potential biases using a range of classification metrics. Our analysis identifies certain subgroups that demonstrate notable underperformance, highlighting the need for ongoing monitoring of these subgroups' performance. To address this, we adopt a monitoring method designed to detect performance drifts over time. Upon identifying a drift, this method issues an alert, which can enable timely interventions. This approach not only provides a tool for tracking the performance but also helps ensure that AI models continue to perform effectively across diverse populations.