🤖 AI Summary
This work proposes a scalable fair clustering algorithm based on finite mixture models to address the limited scalability of existing methods, whose parameters grow with sample size. By decoupling the joint optimization of cluster centers and individual assignments, the approach introduces a parameterization strategy independent of dataset size and integrates maximum likelihood estimation with mini-batch optimization. This enables the algorithm to enforce demographic parity—ensuring that the proportion of sensitive attributes within each cluster matches the global distribution—while accommodating non-metric data such as categorical features. Theoretical analysis and empirical evaluations demonstrate that the method efficiently produces approximately fair clustering solutions, exhibiting superior scalability and broad applicability on large-scale and heterogeneous datasets.
📝 Abstract
The goal of fair clustering is to find clusters such that the proportion of sensitive attributes (e.g., gender, race, etc.) in each cluster is similar to that of the entire dataset. Various fair clustering algorithms have been proposed that modify standard K-means clustering to satisfy a given fairness constraint. A critical limitation of several existing fair clustering algorithms is that the number of parameters to be learned is proportional to the sample size because the cluster assignment of each datum should be optimized simultaneously with the cluster center, and thus scaling up the algorithms is difficult. In this paper, we propose a new fair clustering algorithm based on a finite mixture model, called Fair Model-based Clustering (FMC). A main advantage of FMC is that the number of learnable parameters is independent of the sample size and thus can be scaled up easily. In particular, mini-batch learning is possible to obtain clusters that are approximately fair. Moreover, FMC can be applied to non-metric data (e.g., categorical data) as long as the likelihood is well-defined. Theoretical and empirical justifications for the superiority of the proposed algorithm are provided.