🤖 AI Summary
This work addresses the challenges of slow construction speed, high memory overhead, and limited scalability in existing graph-based approximate nearest neighbor (ANN) indexing methods when handling billion-scale datasets. To overcome these limitations, the authors propose PiPNN, an efficient partition-based parallel graph indexing algorithm. Its key innovation is HashPrune, an online pruning mechanism that dynamically maintains a sparse edge set to enable high-quality index construction under bounded memory. By integrating data partitioning, batched distance computation via dense matrix multiplication, and streaming edge insertion, PiPNN constructs a billion-scale ANN index on a single multi-core machine in just 20 minutes—11.6× and 12.9× faster than Vamana and HNSW, respectively, and at least 17× faster than MIRAGE and FastKCNA—while achieving higher query throughput.
📝 Abstract
The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from.
PiPNN's core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory.
PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.