Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This work addresses the computational challenge of k-means clustering on ultra-large-scale, high-dimensional datasets (10⁷–10⁹ points, d ≥ 100) with extremely large k (>10⁵), where the point-reassignment step in Lloyd’s algorithm incurs an Ω(k²) time bottleneck. We propose a novel paradigm—“seeded approximate nearest neighbor (ANN) search”—and introduce the “seeded search graph” method, built upon graph-based indexing. Our approach integrates seed-guided retrieval, dynamic centroid updates, and memory-aware parallelization. Evaluated on billion-point datasets, it achieves 3.2× speedup over FAISS-kmeans and 17× over Scikit-learn, while maintaining an Adjusted Rand Index (ARI) exceeding 98%. To our knowledge, this is the first method enabling high-quality, real-time k-means clustering at k > 10⁵ scale.

Technology Category

Application Category

📝 Abstract

For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7sim10^9$ points in high-dimensions ($dgeq100$). All current practical methods for this problem have runtimes at least $Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call"Seeded Approximate Nearest-Neighbor Search", for which we propose"Seeded Search-Graph"methods as a solution.

Problem

Research questions and friction points this paper is trying to address.

Scalable k-Means for large k

Speed up Lloyd's algorithm

Seeded Approximate Nearest-Neighbor Search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable k-Means Clustering

Seeded Approximate Nearest-Neighbor

Seeded Search-Graph methods

🔎 Similar Papers

Almost-linear Time Approximation Algorithm to Euclidean k-median and k-means