MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Breast X-ray AI models suffer from poor generalizability and bias due to substantial heterogeneity across public datasets in image quality, metadata standards, and demographic representation—hindering clinical deployment. To address this, we propose the first open-source framework for standardization and bias quantification specifically designed for mammographic data, enabling systematic identification and mitigation of biases related to breast density, lesion distribution, and other domain-specific factors. Our framework harmonizes heterogeneous datasets—including CBIS-DDSM, CMMD, and VinDr-Mammo—via standardized case selection, left-right view correction, intensity normalization, and unified metadata annotation. Experiments demonstrate that the standardized data significantly reduces distributional shift: models trained on cleaned data achieve an average 3.2% AUC improvement on cross-domain test sets and exhibit a 41% reduction in bias metrics, concurrently enhancing both generalizability and fairness.

Technology Category

Application Category

📝 Abstract
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
Problem

Research questions and friction points this paper is trying to address.

Addressing data heterogeneity and bias in mammography AI datasets
Standardizing mammography data processing and metadata across diverse sources
Improving AI model generalization through bias quantification and mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardizes case selection and image processing
Unifies metadata into consistent multi-view structure
Quantifies and mitigates dataset-specific bias sources
🔎 Similar Papers
No similar papers found.
Y
Yalda Zafari
Department of Mathematics and Statistics, Qatar University, Doha, Qatar
Hongyi Pan
Hongyi Pan
Northwestern University
Signal ProcessingMachine LearningImage ProcessingFederated Learning
Gorkem Durak
Gorkem Durak
Northwestern University, Department of Radiology
radiologyartificial intelligence
Ulas Bagci
Ulas Bagci
Northwestern University
artificial intelligencedeep learningbiomedical image analysismedical image computing
E
Essam A. Rashed
Graduate School of Information Science, University of Hyogo, Kobe 650-0047, Japan
M
M. Mabrok
Department of Mathematics and Statistics, Qatar University, Doha, Qatar