A Global Optimization Algorithm for K-Center Clustering of One Billion Samples

📅 2022-12-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the K-center clustering problem—minimizing the maximum intra-cluster distance—on billion-scale datasets. Methodologically, we propose the first provably globally optimal dimensionality-reduction-based branch-and-bound algorithm, featuring a novel two-stage decomposable closed-form lower bound, branching exclusively in the K-dimensional center parameter space, and integrating center-region tightening, sample pruning, and hybrid MPI/OpenMP parallelism. Our approach achieves, for the first time, globally optimal solutions within four hours on ten-million-scale datasets (serial) and billion-scale datasets (parallel). Compared to state-of-the-art heuristic methods, it reduces the objective value by 25.8% on average, substantially improving both solution quality and verifiability.
📝 Abstract
This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.
Problem

Research questions and friction points this paper is trying to address.

Global optimization algorithm for K-center clustering
Minimizes maximum within-cluster distance efficiently
Handles billion-scale datasets with guaranteed convergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduced-space branch and bound scheme
Two-stage decomposable closed-form lower bound
Bounds tightening and sample reduction acceleration
🔎 Similar Papers
No similar papers found.