🤖 AI Summary
This paper studies the fair $k$-center clustering problem over data streams, where the input comprises $m$ sensitive groups, each subject to an upper bound on the number of selected centers to ensure representativeness. We propose the first single-pass streaming algorithm for this problem. Its core innovation is the construction of a $lambda$-independent center set—introduced here for the first time in streaming settings—and formulating center selection as a constrained vertex cover problem. Our algorithm achieves a tight 5-approximation ratio with memory complexity $O(k log n)$. An offline variant attains a 3-approximation, matching the current state-of-the-art. We further extend the framework to semi-structured streams and multi-group batch-arrival settings, designing efficient batch-processing strategies. Experiments demonstrate that our method significantly outperforms existing baselines in both clustering quality and runtime efficiency, offering strong theoretical guarantees and practical scalability.
📝 Abstract
Many real-world applications pose challenges in incorporating fairness constraints into the $k$-center clustering problem, where the dataset consists of $m$ demographic groups, each with a specified upper bound on the number of centers to ensure fairness. Focusing on big data scenarios, this paper addresses the problem in a streaming setting, where data points arrive one by one sequentially in a continuous stream. Leveraging a structure called the $λ$-independent center set, we propose a one-pass streaming algorithm that first computes a reserved set of points during the streaming process. Then, for the post-streaming process, we propose an approach for selecting centers from the reserved point set by analyzing all three possible cases, transforming the most complicated one into a specially constrained vertex cover problem in an auxiliary graph. Our algorithm achieves a tight approximation ratio of 5 while consuming $O(klog n)$ memory. It can also be readily adapted to solve the offline fair $k$-center problem, achieving a 3-approximation ratio that matches the current state of the art. Furthermore, we extend our approach to a semi-structured data stream, where data points from each group arrive in batches. In this setting, we present a 3-approximation algorithm for $m = 2$ and a 4-approximation algorithm for general $m$. Lastly, we conduct extensive experiments to evaluate the performance of our approaches, demonstrating that they outperform existing baselines in both clustering cost and runtime efficiency.