Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational challenge of k-means clustering on ultra-large-scale, high-dimensional datasets (10⁷–10⁹ points, d ≥ 100) with extremely large k (>10⁵), where the point-reassignment step in Lloyd’s algorithm incurs an Ω(k²) time bottleneck. We propose a novel paradigm—“seeded approximate nearest neighbor (ANN) search”—and introduce the “seeded search graph” method, built upon graph-based indexing. Our approach integrates seed-guided retrieval, dynamic centroid updates, and memory-aware parallelization. Evaluated on billion-point datasets, it achieves 3.2× speedup over FAISS-kmeans and 17× over Scikit-learn, while maintaining an Adjusted Rand Index (ARI) exceeding 98%. To our knowledge, this is the first method enabling high-quality, real-time k-means clustering at k > 10⁵ scale.

Technology Category

Application Category

📝 Abstract
For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7sim10^9$ points in high-dimensions ($dgeq100$). All current practical methods for this problem have runtimes at least $Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call"Seeded Approximate Nearest-Neighbor Search", for which we propose"Seeded Search-Graph"methods as a solution.
Problem

Research questions and friction points this paper is trying to address.

Scalable k-Means for large k
Speed up Lloyd's algorithm
Seeded Approximate Nearest-Neighbor Search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable k-Means Clustering
Seeded Approximate Nearest-Neighbor
Seeded Search-Graph methods
J
Jack Spalding-Jamieson
Independent
E
Eliot Wong Robson
Department of Computer Science, University of Illinois
Da Wei Zheng
Da Wei Zheng
University of Illinois Urbana-Champaign
theoretical computer sciencecomputational geometrydynamic data structuresgraph algorithms