A Scalable Global Optimization Algorithm For Constrained Clustering

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Constrained clustering with pairwise must-link and cannot-link constraints poses significant scalability challenges for existing mixed-integer programming (MIP) approaches, which struggle to balance global optimality and computational tractability on large-scale datasets. This paper proposes a scalable branch-and-bound framework that, for the first time, integrates groupwise Lagrangian decomposition with geometric pruning rules. To further enhance efficiency and scalability, we introduce pseudo-sample centering, geometric elimination, and parallelization strategies. Experimental results demonstrate that our method solves large-scale instances containing up to 200,000 cannot-link and 1,500,000 must-link constraints—orders of magnitude beyond prior art—achieving speedups of 200×–1500× over the current state-of-the-art while maintaining an optimality gap below 3%. This breakthrough effectively overcomes the scalability bottleneck in globally optimal constrained clustering.

Technology Category

Application Category

📝 Abstract
Constrained clustering leverages limited domain knowledge to improve clustering performance and interpretability, but incorporating pairwise must-link and cannot-link constraints is an NP-hard challenge, making global optimization intractable. Existing mixed-integer optimization methods are confined to small-scale datasets, limiting their utility. We propose Sample-Driven Constrained Group-Based Branch-and-Bound (SDC-GBB), a decomposable branch-and-bound (BB) framework that collapses must-linked samples into centroid-based pseudo-samples and prunes cannot-link through geometric rules, while preserving convergence and guaranteeing global optimality. By integrating grouped-sample Lagrangian decomposition and geometric elimination rules for efficient lower and upper bounds, the algorithm attains highly scalable pairwise k-Means constrained clustering via parallelism. Experimental results show that our approach handles datasets with 200,000 samples with cannot-link constraints and 1,500,000 samples with must-link constraints, which is 200 - 1500 times larger than the current state-of-the-art under comparable constraint settings, while reaching an optimality gap of less than 3%. In providing deterministic global guarantees, our method also avoids the search failures that off-the-shelf heuristics often encounter on large datasets.
Problem

Research questions and friction points this paper is trying to address.

Solving NP-hard constrained clustering with global optimization
Scaling constrained clustering to large datasets efficiently
Ensuring deterministic global optimality with minimal optimality gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposable branch-and-bound framework with centroid-based pseudo-samples
Geometric elimination rules for pruning cannot-link constraints
Parallel Lagrangian decomposition enabling scalable constrained clustering
P
Pedro Chumpitaz-Flores
Department of Industrial and Systems Engineering, University of South Florida
M
My Duong
Bellini College of AI, Cybersecurity, and Computing, University of South Florida
C
Cristobal Heredia
Department of Industrial and Systems Engineering, University of South Florida
Kaixun Hua
Kaixun Hua
Assistant Professor, University of South Florida
Trustworthy AIClusteringGlobal Optimization