🤖 AI Summary
This work addresses the high computational cost of clustering high-dimensional vectors, which poses a significant bottleneck for large-scale indexing in vector retrieval systems. To overcome this challenge, the authors propose SuperKMeans, a novel approach that integrates dimension pruning with a recall-based early stopping mechanism. This combination substantially accelerates k-means training while preserving the quality of cluster centroids. Empirical evaluations demonstrate that SuperKMeans achieves a 7× speedup over FAISS and Scikit-Learn on CPU and outperforms cuVS by 4× on GPU, all without compromising clustering accuracy. The method thus enables efficient, high-quality clustering of high-dimensional data, making it well-suited for deployment in large-scale vector search infrastructures.
📝 Abstract
We present SuperKMeans: a k-means variant designed for clustering collections of high-dimensional vector embeddings. SuperKMeans' clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data-access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early-terminates k-means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open-source our implementation at https://github.com/cwida/SuperKMeans