A Closer Look at AUROC and AUPRC under Class Imbalance

📅 2024-01-11
🏛️ arXiv.org
📈 Citations: 15
Influential: 0
📄 PDF
🤖 AI Summary
This paper challenges the widely held assumption in machine learning that the Area Under the Precision-Recall Curve (AUPRC) is universally superior to the Area Under the ROC Curve (AUROC) under class imbalance—and investigates its implications for algorithmic fairness. Method: We conduct rigorous theoretical analysis, experiments on semi-synthetic and real-world fairness-sensitive datasets, and a large-scale bibliometric study covering over one million publications. Contribution/Results: We provide the first formal proof that AUPRC is not generally advantageous under extreme imbalance; instead, it systematically amplifies group-level bias by favoring subpopulations with higher positive-class density. We trace and empirically refute the long-standing misconception that “AUPRC is inherently better.” Furthermore, we establish verifiable criteria delineating the applicability boundaries of AUROC versus AUPRC, and propose a principled, imbalance- and subgroup-aware framework for metric selection—thereby offering both theoretical foundations and practical guidance for fair model evaluation.

Technology Category

Application Category

📝 Abstract
In machine learning (ML), a widespread claim is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for tasks with class imbalance. This paper refutes this notion on two fronts. First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes, establishing clearly that AUPRC is not generally superior in cases of class imbalance. We further show that AUPRC can be a harmful metric as it can unduly favor model improvements in subpopulations with more frequent positive labels, heightening algorithmic disparities. Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. Prompted by these insights, we conduct a review of over 1.5 million scientific papers to understand the origin of this invalid claim, finding that it is often made without citation, misattributed to papers that do not argue this point, and aggressively over-generalized from source arguments. Our findings represent a dual contribution: a significant technical advancement in understanding the relationship between AUROC and AUPRC and a stark warning about unchecked assumptions in the ML community.
Problem

Research questions and friction points this paper is trying to address.

Imbalanced Classes
AUPRC vs AUROC
Algorithmic Fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

AUPRC Misinterpretation
Algorithmic Fairness
AUROC Reevaluation
🔎 Similar Papers
No similar papers found.
Matthew B. A. McDermott
Matthew B. A. McDermott
Assistant Professor, Columbia University Department of Biomedical Informatics
Machine LearningBiomedical Informatics
L
Lasse Hyldig Hansen
Aarhus University
H
Haoran Zhang
Massachusetts Institute of Technology
G
Giovanni Angelotti
IRCCS Humanitas Research Hospital
J
J. Gallifant
Massachusetts Institute of Technology