🤖 AI Summary
This work addresses the computational challenge of k-means clustering on ultra-large-scale, high-dimensional datasets (10⁷–10⁹ points, d ≥ 100) with extremely large k (>10⁵), where the point-reassignment step in Lloyd’s algorithm incurs an Ω(k²) time bottleneck. We propose a novel paradigm—“seeded approximate nearest neighbor (ANN) search”—and introduce the “seeded search graph” method, built upon graph-based indexing. Our approach integrates seed-guided retrieval, dynamic centroid updates, and memory-aware parallelization. Evaluated on billion-point datasets, it achieves 3.2× speedup over FAISS-kmeans and 17× over Scikit-learn, while maintaining an Adjusted Rand Index (ARI) exceeding 98%. To our knowledge, this is the first method enabling high-quality, real-time k-means clustering at k > 10⁵ scale.
📝 Abstract
For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7sim10^9$ points in high-dimensions ($dgeq100$). All current practical methods for this problem have runtimes at least $Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call"Seeded Approximate Nearest-Neighbor Search", for which we propose"Seeded Search-Graph"methods as a solution.