🤖 AI Summary
This study addresses the limited generalizability and insufficient fairness of existing AI models for breast MRI, which are predominantly trained on single-center data. To overcome these limitations, we established the first intercontinental, multicenter benchmark dataset for breast MRI and, within an international challenge, jointly evaluated the generalization performance and fairness of models across two critical tasks: tumor segmentation and prediction of pathological complete response after neoadjuvant chemotherapy. Through standardized preprocessing, independent external testing, and a unified scoring framework that integrates overall performance with subgroup fairness—stratified by age, menopausal status, and breast density—we engaged 26 international teams. Our findings reveal a significant performance drop on external test sets and a trade-off between accuracy and fairness, thereby providing a crucial benchmark and empirical foundation for developing robust and equitable AI systems in breast cancer imaging.
📝 Abstract
Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.