A column generation algorithm with dynamic constraint aggregation for minimum sum-of-squares clustering

📅 2024-10-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the exact solution of Minimum Sum-of-Squares Clustering (MSSC), i.e., k-means. For large-scale instances, conventional column generation (CG) suffers from low efficiency due to massive constraints and severe degeneracy in the master problem. To overcome this bottleneck, we introduce Dynamic Constraint Aggregation (DCA) into the CG framework for MSSC—marking the first application of DCA to this problem. By clustering similar constraints, DCA substantially reduces the master problem size and mitigates degeneracy. Leveraging the geometric structure of Euclidean distances, we formulate an integer programming model and conduct systematic ablation studies. Our algorithm significantly outperforms state-of-the-art exact methods across multiple large-scale benchmarks: speedups reach an order of magnitude, scalability is markedly improved, and—for the first time—enables efficient exact k-means clustering on datasets with tens of thousands of samples.

Technology Category

Application Category

📝 Abstract
The minimum sum-of-squares clustering problem (MSSC), also known as $k$-means clustering, refers to the problem of partitioning $n$ data points into $k$ clusters, with the objective of minimizing the total sum of squared Euclidean distances between each point and the center of its assigned cluster. We propose an efficient algorithm for solving large-scale MSSC instances, which combines column generation (CG) with dynamic constraint aggregation (DCA) to effectively reduce the number of constraints considered in the CG master problem. DCA was originally conceived to reduce degeneracy in set partitioning problems by utilizing an aggregated restricted master problem obtained from a partition of the set partitioning constraints into disjoint clusters. In this work, we explore the use of DCA within a CG algorithm for MSSC exact solution. Our method is fine-tuned by a series of ablation studies on DCA design choices, and is demonstrated to significantly outperform existing state-of-the-art exact approaches available in the literature.
Problem

Research questions and friction points this paper is trying to address.

Solving large-scale MSSC (k-means) clustering efficiently
Reducing constraints in column generation via dynamic aggregation
Improving exact solution performance for MSSC problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Column generation with dynamic constraint aggregation
Reduces constraints in CG master problem
Optimized via DCA design ablation studies
🔎 Similar Papers
No similar papers found.
A
A. M. Sudoso
Department of Computer, Control and Management Engineering "Antonio Ruberti", Sapienza University of Rome, Via Ariosto 25, Rome, 00185, Italy.
Daniel Aloise
Daniel Aloise
Department of Computer and Software Engineering, Polytechnique Montréal, 2500 Chem. de Polytechnique, Montréal, H3T 1J4, Canada.