๐ค AI Summary
This paper studies the fairness-constrained diversity maximization problem: selecting $k$ representative points from a multi-label dataset such that at most a constant number of samples is chosen from each label class, while maximizing the minimum pairwise distance. For small-scale cross-group representative selection, we propose the first three-stage frameworkโpre-pruning, padded decomposition, and label-aware assignment. We introduce a distance-driven pre-pruning step to improve efficiency; employ randomized padded decomposition to balance intra-group sparsity and global diversity; and design a label-aware cluster assignment mechanism to ensure fair label coverage. We theoretically establish an approximation ratio of $Omega(sqrt{log m}/m)$, significantly improving upon prior methods. Experiments on large-scale datasets demonstrate that our approach achieves state-of-the-art performance in both minimum pairwise distance and label coverage, while strictly satisfying fairness constraints.
๐ Abstract
Diversity maximization problem is a well-studied problem where the goal is to find $k$ diverse items. Fair diversity maximization aims to select a diverse subset of $k$ items from a large dataset, while requiring that each group of items be well represented in the output. More formally, given a set of items with labels, our goal is to find $k$ items that maximize the minimum pairwise distance in the set, while maintaining that each label is represented within some budget. In many cases, one is only interested in selecting a handful (say a constant) number of items from each group. In such scenario we show that a randomized algorithm based on padded decompositions improves the state-of-the-art approximation ratio to $sqrt{log(m)}/(3m)$, where $m$ is the number of labels. The algorithms work in several stages: ($i$) a preprocessing pruning which ensures that points with the same label are far away from each other, ($ii$) a decomposition phase, where points are randomly placed in clusters such that there is a feasible solution with maximum one point per cluster and that any feasible solution will be diverse, $(iii)$ assignment phase, where clusters are assigned to labels, and a representative point with the corresponding label is selected from each cluster. We experimentally verify the effectiveness of our algorithm on large datasets.