🤖 AI Summary
This paper challenges the widely held assumption in machine learning that the Area Under the Precision-Recall Curve (AUPRC) is universally superior to the Area Under the ROC Curve (AUROC) under class imbalance—and investigates its implications for algorithmic fairness. Method: We conduct rigorous theoretical analysis, experiments on semi-synthetic and real-world fairness-sensitive datasets, and a large-scale bibliometric study covering over one million publications. Contribution/Results: We provide the first formal proof that AUPRC is not generally advantageous under extreme imbalance; instead, it systematically amplifies group-level bias by favoring subpopulations with higher positive-class density. We trace and empirically refute the long-standing misconception that “AUPRC is inherently better.” Furthermore, we establish verifiable criteria delineating the applicability boundaries of AUROC versus AUPRC, and propose a principled, imbalance- and subgroup-aware framework for metric selection—thereby offering both theoretical foundations and practical guidance for fair model evaluation.
📝 Abstract
In machine learning (ML), a widespread claim is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for tasks with class imbalance. This paper refutes this notion on two fronts. First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes, establishing clearly that AUPRC is not generally superior in cases of class imbalance. We further show that AUPRC can be a harmful metric as it can unduly favor model improvements in subpopulations with more frequent positive labels, heightening algorithmic disparities. Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. Prompted by these insights, we conduct a review of over 1.5 million scientific papers to understand the origin of this invalid claim, finding that it is often made without citation, misattributed to papers that do not argue this point, and aggressively over-generalized from source arguments. Our findings represent a dual contribution: a significant technical advancement in understanding the relationship between AUROC and AUPRC and a stark warning about unchecked assumptions in the ML community.