On the Role of Calibration in Benchmarking Algorithmic Fairness for Skin Cancer Detection

📅 2025-10-29

🏛️ Machine Learning for Biomedical Imaging

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

State-of-the-art AI models for melanoma detection achieve high AUROC but exhibit significant calibration disparities across demographic subgroups (skin tone, sex, age), leading to miscalibrated clinical risk estimates and impeding equitable deployment. Method: This work introduces the first systematic integration of calibration metrics—specifically Expected Calibration Error (ECE) and reliability diagrams—into algorithmic fairness benchmarking, jointly evaluating discriminative performance (AUROC) and calibration across subgroups. We assess the top three ISIC 2020 challenge models on both the ISIC 2020 and PROVE-AI datasets, stratified by Fitzpatrick skin type, sex, and age. Results: All leading models demonstrate substantial calibration imbalance across subgroups, with pronounced overdiagnosis risk in individuals with darker skin tones and older adults. Our findings expose the critical limitation of relying solely on discriminative metrics for fairness assessment. We propose a “discrimination–calibration” dual-axis fairness evaluation paradigm, establishing a new standard for auditing trustworthy dermatologic AI and facilitating its responsible clinical translation.

Technology Category

Application Category

📝 Abstract

Artificial Intelligence (AI) models have demonstrated expert-level performance in melanoma detection, yet their clinical adoption is hindered by performance disparities across demographic subgroups such as gender, race, and age. Previous efforts to benchmark the performance of AI models have primarily focused on assessing model performance using group fairness metrics that rely on the Area Under the Receiver Operating Characteristic curve (AUROC), which does not provide insights into a model’s ability to provide accurate estimates. In line with clinical assessments, this paper addresses this gap by incorporating calibration as a complementary benchmarking metric to AUROC-based fairness metrics. Calibration evaluates the alignment between predicted probabilities and observed event rates, offering deeper insights into subgroup biases. We assess the performance of the leading skin cancer detection algorithm of the ISIC 2020 Challenge on the ISIC 2020 Challenge dataset and the PROVE-AI dataset, and compare it with the second- and third-place models, focusing on subgroups defined by sex, race (Fitzpatrick Skin Tone), and age. Our findings reveal that while existing models enhance discriminative accuracy, they often over-diagnose risk and exhibit calibration issues when applied to new datasets. This study underscores the necessity for comprehensive model auditing strategies and extensive metadata collection to achieve equitable AI-driven healthcare solutions.

Problem

Research questions and friction points this paper is trying to address.

Addressing performance disparities in skin cancer detection AI across demographic subgroups

Incorporating calibration metrics alongside AUROC for comprehensive fairness benchmarking

Identifying model over-diagnosis and calibration issues when applied to new datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses calibration as complementary metric to AUROC

Evaluates predicted probabilities alignment with event rates

Assesses model performance across demographic subgroups

🔎 Similar Papers

Skin Cancer Machine Learning Model Tone Bias