🤖 AI Summary
Filtering Approximate Nearest Neighbor (ANN) search—i.e., vector similarity search under structured attribute constraints—lacks a systematic analytical framework and standardized evaluation methodology.
Method: We propose the first comprehensive analysis framework for Filtering ANN, featuring (1) a unified interface and a novel taxonomy grounded in attribute types and filtering strategies; (2) a reproducible experimental platform that decouples the impacts of index structures, pruning techniques, and entry-point selection on performance; and (3) rigorous evaluation across four real-world and synthetic datasets (up to 10M entries), covering ten algorithms and twelve variants under query selectivity ranging from high to low.
Contribution/Results: Our analysis reveals empirically validated combinatorial principles for effective multi-dimensional pruning, edge filtering, and entry-point optimization. We provide actionable, application-oriented guidelines for algorithm selection. All code is open-sourced to advance standardization and reproducibility in Filtering ANN research.
📝 Abstract
With the growing integration of structured and unstructured data, new methods have emerged for performing similarity searches on vectors while honoring structured attribute constraints, i.e., a process known as Filtering Approximate Nearest Neighbor (Filtering ANN) search. Since many of these algorithms have only appeared in recent years and are designed to work with a variety of base indexing methods and filtering strategies, there is a pressing need for a unified analysis that identifies their core techniques and enables meaningful comparisons.
In this work, we present a unified Filtering ANN search interface that encompasses the latest algorithms and evaluate them extensively from multiple perspectives. First, we propose a comprehensive taxonomy of existing Filtering ANN algorithms based on attribute types and filtering strategies. Next, we analyze their key components, i.e., index structures, pruning strategies, and entry point selection, to elucidate design differences and tradeoffs. We then conduct a broad experimental evaluation on 10 algorithms and 12 methods across 4 datasets (each with up to 10 million items), incorporating both synthetic and real attributes and covering selectivity levels from 0.1% to 100%. Finally, an in-depth component analysis reveals the influence of pruning, entry point selection, and edge filtering costs on overall performance. Based on our findings, we summarize the strengths and limitations of each approach, provide practical guidelines for selecting appropriate methods, and suggest promising directions for future research. Our code is available at: https://github.com/lmccccc/FANNBench.