🤖 AI Summary
This paper studies bipartite ranking under multiple annotators: given inconsistent binary labels from diverse annotators, how to synthesize a ranking that maximizes the Area Under the ROC Curve (AUC). We formally analyze—through the lenses of Bayesian optimality and Pareto optimality—the two dominant aggregation paradigms: loss aggregation (aggregating annotator-specific losses) and label aggregation (aggregating raw labels prior to model training). We prove both achieve Pareto optimality in expectation; however, loss aggregation suffers from “label dictatorship,” where a single noisy annotator can dominate the ranking objective, undermining robustness. In contrast, label aggregation exhibits superior robustness to annotation noise. Empirical evaluation on real-world multi-annotator datasets demonstrates that label aggregation significantly improves both AUC stability and absolute performance. Our core contribution is the theoretical and empirical identification of aggregation mechanism as a fundamental determinant of ranking robustness, establishing label aggregation as the theoretically grounded and empirically superior approach.
📝 Abstract
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem -- loss aggregation and label aggregation -- by characterizing their Bayes-optimal solutions. Based on this, we show that while both methods can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.