Fair Model-based Clustering

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a scalable fair clustering algorithm based on finite mixture models to address the limited scalability of existing methods, whose parameters grow with sample size. By decoupling the joint optimization of cluster centers and individual assignments, the approach introduces a parameterization strategy independent of dataset size and integrates maximum likelihood estimation with mini-batch optimization. This enables the algorithm to enforce demographic parity—ensuring that the proportion of sensitive attributes within each cluster matches the global distribution—while accommodating non-metric data such as categorical features. Theoretical analysis and empirical evaluations demonstrate that the method efficiently produces approximately fair clustering solutions, exhibiting superior scalability and broad applicability on large-scale and heterogeneous datasets.

Technology Category

Application Category

📝 Abstract
The goal of fair clustering is to find clusters such that the proportion of sensitive attributes (e.g., gender, race, etc.) in each cluster is similar to that of the entire dataset. Various fair clustering algorithms have been proposed that modify standard K-means clustering to satisfy a given fairness constraint. A critical limitation of several existing fair clustering algorithms is that the number of parameters to be learned is proportional to the sample size because the cluster assignment of each datum should be optimized simultaneously with the cluster center, and thus scaling up the algorithms is difficult. In this paper, we propose a new fair clustering algorithm based on a finite mixture model, called Fair Model-based Clustering (FMC). A main advantage of FMC is that the number of learnable parameters is independent of the sample size and thus can be scaled up easily. In particular, mini-batch learning is possible to obtain clusters that are approximately fair. Moreover, FMC can be applied to non-metric data (e.g., categorical data) as long as the likelihood is well-defined. Theoretical and empirical justifications for the superiority of the proposed algorithm are provided.
Problem

Research questions and friction points this paper is trying to address.

fair clustering
scalability
parameter efficiency
large-scale data
Innovation

Methods, ideas, or system contributions that make the work stand out.

fair clustering
model-based clustering
finite mixture model
scalability
non-metric data
🔎 Similar Papers
No similar papers found.
J
Jinwon Park
Graduate School of Data Science, Seoul National University
Kunwoong Kim
Kunwoong Kim
Seoul National University
J
Jihu Lee
Department of Statistics, Seoul National University
Yongdai Kim
Yongdai Kim
Seoul National University
statisticsmachine learning