Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical AI challenge leaderboards suffer from three critical flaws: (1) score differences lack statistical significance testing; (2) single average metrics obscure organ-specific performance disparities; and (3) fairness across demographic subgroups is routinely ignored. To address these issues, we introduce RankInsight—a novel open-source toolkit featuring a three-dimensional auditing framework: (1) pairwise significance mapping via permutation tests and bootstrap resampling; (2) dynamic leaderboard re-ranking using organ-sensitive metrics (e.g., Normalized Surface Distance, Dice); and (3) fine-grained performance decomposition and disparity visualization across intersecting gender–race subgroups. Evaluating nnU-Net, vision-language, and MONAI models, our framework reveals rank reversals among the top four models under NSD-based re-ranking and uncovers statistically significant gender–race performance gaps in over 50% of MONAI models. RankInsight thus shifts leaderboard evaluation from purely technical ranking toward clinical trustworthiness and algorithmic fairness.

Technology Category

Application Category

📝 Abstract
Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.
Problem

Research questions and friction points this paper is trying to address.

Auditing statistical significance of medical AI leaderboard score gaps
Selecting organ-specific metrics to reveal clinically important errors
Evaluating intersectional demographic fairness across gender and race
Innovation

Methods, ideas, or system contributions that make the work stand out.

Computes statistical significance maps for model comparisons
Recomputes leaderboards with organ-specific evaluation metrics
Audits intersectional fairness across demographic groups
🔎 Similar Papers
No similar papers found.
A
Ariel Lubonja
Johns Hopkins University
P
Pedro R. A. S. Bassi
Johns Hopkins University
Wenxuan Li
Wenxuan Li
Johns Hopkins University
Imaging InformaticsComputer-aided Diagnosis
H
Hualin Qiao
Johns Hopkins University
Randal Burns
Randal Burns
Professor of Computer Science, Johns Hopkins University
StorageHigh-Performance ComputingScientific Databases
A
Alan L. Yuille
Johns Hopkins University
Z
Zongwei Zhou
Johns Hopkins University