🤖 AI Summary
This paper formally defines and studies the fair consensus clustering problem: aggregating multiple input clusterings—each constructed solely on non-sensitive attributes—while ensuring the final clustering both preserves the collective data structure and satisfies demographic fairness, i.e., each protected group appears in every cluster proportionally to its global prevalence. As the problem is NP-hard, we propose the first polynomial-time algorithm with a constant-factor approximation guarantee. We further design an optimal post-processing correction framework that enforces fairness at minimal adjustment cost. Specifically, for the equal-group-proportion setting, we provide an exact optimal algorithm; for two groups with arbitrary proportions, we achieve a constant-factor approximation; and we establish the first theoretically provable fairness-aware approximation guarantee for consensus clustering.
📝 Abstract
Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best represents the collective structure of the data. In this work, we study this fundamental problem through the lens of fair clustering, as introduced by Chierichetti et al. [NeurIPS'17], which incorporates the disparate impact doctrine to ensure proportional representation of each protected group in the dataset within every cluster. Our objective is to find a consensus clustering that is not only representative but also fair with respect to specific protected attributes. To the best of our knowledge, we are the first to address this problem and provide a constant-factor approximation. As part of our investigation, we examine how to minimally modify an existing clustering to enforce fairness -- an essential postprocessing step in many clustering applications that require fair representation. We develop an optimal algorithm for datasets with equal group representation and near-linear time constant factor approximation algorithms for more general scenarios with different proportions of two group sizes. We complement our approximation result by showing that the problem is NP-hard for two unequal-sized groups. Given the fundamental nature of this problem, we believe our results on Closest Fair Clustering could have broader implications for other clustering problems, particularly those for which no prior approximation guarantees exist for their fair variants.