Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Medical AI challenge leaderboards suffer from three critical flaws: (1) score differences lack statistical significance testing; (2) single average metrics obscure organ-specific performance disparities; and (3) fairness across demographic subgroups is routinely ignored. To address these issues, we introduce RankInsight—a novel open-source toolkit featuring a three-dimensional auditing framework: (1) pairwise significance mapping via permutation tests and bootstrap resampling; (2) dynamic leaderboard re-ranking using organ-sensitive metrics (e.g., Normalized Surface Distance, Dice); and (3) fine-grained performance decomposition and disparity visualization across intersecting gender–race subgroups. Evaluating nnU-Net, vision-language, and MONAI models, our framework reveals rank reversals among the top four models under NSD-based re-ranking and uncovers statistically significant gender–race performance gaps in over 50% of MONAI models. RankInsight thus shifts leaderboard evaluation from purely technical ranking toward clinical trustworthiness and algorithmic fairness.

Technology Category

Application Category

📝 Abstract

Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.

Problem

Research questions and friction points this paper is trying to address.

Auditing statistical significance of medical AI leaderboard score gaps

Selecting organ-specific metrics to reveal clinically important errors

Evaluating intersectional demographic fairness across gender and race

Innovation

Methods, ideas, or system contributions that make the work stand out.

Computes statistical significance maps for model comparisons

Recomputes leaderboards with organ-specific evaluation metrics

Audits intersectional fairness across demographic groups

🔎 Similar Papers

No similar papers found.