On the Role of Calibration in Benchmarking Algorithmic Fairness for Skin Cancer Detection

📅 2025-10-29
🏛️ Machine Learning for Biomedical Imaging
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
State-of-the-art AI models for melanoma detection achieve high AUROC but exhibit significant calibration disparities across demographic subgroups (skin tone, sex, age), leading to miscalibrated clinical risk estimates and impeding equitable deployment. Method: This work introduces the first systematic integration of calibration metrics—specifically Expected Calibration Error (ECE) and reliability diagrams—into algorithmic fairness benchmarking, jointly evaluating discriminative performance (AUROC) and calibration across subgroups. We assess the top three ISIC 2020 challenge models on both the ISIC 2020 and PROVE-AI datasets, stratified by Fitzpatrick skin type, sex, and age. Results: All leading models demonstrate substantial calibration imbalance across subgroups, with pronounced overdiagnosis risk in individuals with darker skin tones and older adults. Our findings expose the critical limitation of relying solely on discriminative metrics for fairness assessment. We propose a “discrimination–calibration” dual-axis fairness evaluation paradigm, establishing a new standard for auditing trustworthy dermatologic AI and facilitating its responsible clinical translation.

Technology Category

Application Category

📝 Abstract
Artificial Intelligence (AI) models have demonstrated expert-level performance in melanoma detection, yet their clinical adoption is hindered by performance disparities across demographic subgroups such as gender, race, and age. Previous efforts to benchmark the performance of AI models have primarily focused on assessing model performance using group fairness metrics that rely on the Area Under the Receiver Operating Characteristic curve (AUROC), which does not provide insights into a model’s ability to provide accurate estimates. In line with clinical assessments, this paper addresses this gap by incorporating calibration as a complementary benchmarking metric to AUROC-based fairness metrics. Calibration evaluates the alignment between predicted probabilities and observed event rates, offering deeper insights into subgroup biases. We assess the performance of the leading skin cancer detection algorithm of the ISIC 2020 Challenge on the ISIC 2020 Challenge dataset and the PROVE-AI dataset, and compare it with the second- and third-place models, focusing on subgroups defined by sex, race (Fitzpatrick Skin Tone), and age. Our findings reveal that while existing models enhance discriminative accuracy, they often over-diagnose risk and exhibit calibration issues when applied to new datasets. This study underscores the necessity for comprehensive model auditing strategies and extensive metadata collection to achieve equitable AI-driven healthcare solutions.
Problem

Research questions and friction points this paper is trying to address.

Addressing performance disparities in skin cancer detection AI across demographic subgroups
Incorporating calibration metrics alongside AUROC for comprehensive fairness benchmarking
Identifying model over-diagnosis and calibration issues when applied to new datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses calibration as complementary metric to AUROC
Evaluates predicted probabilities alignment with event rates
Assesses model performance across demographic subgroups
🔎 Similar Papers
2024-10-08arXiv.orgCitations: 0
B
Brandon Dominique
Northeastern University, Boston, MA, USA 02115
P
Prudence Lam
Northeastern University, Boston, MA, USA 02115
N
N. Kurtansky
Memorial Sloan Kettering Cancer Center, New York, NY, USA 10065
J
Jochen Weber
Memorial Sloan Kettering Cancer Center, New York, NY, USA 10065
Kivanc Kose
Kivanc Kose
Memorial Sloan Kettering Cancer Center
Signal ProcessingComputer VisionMachine Learning
Veronica Rotemberg
Veronica Rotemberg
Memorial Sloan-Kettering Cancer Center
Jennifer Dy
Jennifer Dy
Electrical and Computer Engineering, Northeastern University
Machine Learning and Data Mining