🤖 AI Summary
This work addresses the high I/O overhead and latency incurred by existing systems when performing attribute-constrained vector search on SSDs, primarily due to frequent attribute data accesses. To mitigate this, the authors propose an efficient filtering mechanism that leverages probabilistic data structures—such as Bloom filters—to construct a superset of candidate vectors, thereby drastically reducing the number of attribute reads while tolerating a small number of false positives. This is followed by approximate nearest neighbor search and post-hoc attribute verification to ensure result correctness. The proposed approach significantly lowers I/O costs, achieving substantially higher throughput and lower latency compared to state-of-the-art systems, with the implementation publicly released on an open-source platform.
📝 Abstract
We propose PipeANN-Filter, an efficient filtered vector search system on SSD. Unlike existing systems that explore only valid vectors (i.e., those satisfying the attribute constraints) during search, PipeANN-Filter explores a superset of valid vectors, and performs attribute verification after getting the top-k closest result vectors. This allows PipeANN-Filter to leverage probabilistic data structures (e.g., Bloom filters) to identify the superset, trading off a small number of false-positive vector explorations for a massive reduction in SSD I/O for attribute reading. Evaluations show that PipeANN-Filter improves search latency and throughput compared to state-of-the-art systems. PipeANN-Filter is open-source at https://github.com/thustorage/PipeANN