π€ AI Summary
The field of Fast Approximate Nearest Neighbor Search (FANNS) lacks a systematic survey addressing vector-scalar hybrid data, suffering from inconsistent problem formulations, absence of a unified algorithm taxonomy, and insufficient analysis of query difficulty.
Method: We formally define hybrid datasets and hybrid queries, propose a fine-grained algorithm taxonomy centered on pruning mechanisms, and develop a distribution-sensitive query difficulty model. We further design a standardized evaluation framework and an open-source toolchain (Python/PyTorch) supporting hybrid dataset construction, quantitative difficulty assessment, and fair algorithm comparison.
Contribution/Results: This work delivers the first structured, comprehensive survey of FANNS for hybrid dataβfilling a critical research gap. It establishes foundational theoretical principles and practical tools, enabling rigorous analysis and reproducible advancement in hybrid-data nearest neighbor search.
π Abstract
Filtered approximate nearest neighbor search (FANNS), an extension of approximate nearest neighbor search (ANNS) that incorporates scalar filters, has been widely applied to constrained retrieval of vector data. Despite its growing importance, no dedicated survey on FANNS over the vector-scalar hybrid data currently exists, and the field has several problems, including inconsistent definitions of the search problem, insufficient framework for algorithm classification, and incomplete analysis of query difficulty. This survey paper formally defines the concepts of hybrid dataset and hybrid query, as well as the corresponding evaluation metrics. Based on these, a pruning-focused framework is proposed to classify and summarize existing algorithms, providing a broader and finer-grained classification framework compared to the existing ones. In addition, a review is conducted on representative hybrid datasets, followed by an analysis on the difficulty of hybrid queries from the perspective of distribution relationships between data and queries. This paper aims to establish a structured foundation for FANNS over the vector-scalar hybrid data, facilitate more meaningful comparisons between FANNS algorithms, and offer practical recommendations for practitioners. The code used for downloading hybrid datasets and analyzing query difficulty is available at https://github.com/lyj-fdu/FANNS