🤖 AI Summary
This work addresses the limitation of existing vector databases, which support only categorical or numerical attribute filtering and struggle to efficiently handle joint queries involving sequence pattern constraints—such as substring matching—and approximate nearest neighbor (ANN) search. To bridge this gap, we propose VectorMaton, a novel index structure based on augmented suffix automata that enables deep integration of sequence pattern predicates (e.g., LIKE/CONTAINS) with ANN search for the first time. VectorMaton achieves significant query efficiency gains while maintaining an index size close to that of the raw data. Experimental results on real-world datasets demonstrate that, at comparable accuracy, VectorMaton improves query throughput by up to 10× over baseline methods and reduces index size by up to 18×.
📝 Abstract
Approximate nearest neighbor search (ANNS) has become a cornerstone in modern vector database systems. Given a query vector, ANNS retrieves the closest vectors from a set of base vectors. In real-world applications, vectors are often accompanied by additional information, such as sequences or structured attributes, motivating the need for fine-grained vector search with constraints on this auxiliary data. Existing methods support attribute-based filtering or range-based filtering on categorical and numerical attributes, but they do not support pattern predicates over sequence attributes. In relational databases, predicates such as LIKE and CONTAINS are fundamental operators for filtering records based on substring patterns. As vector databases increasingly adopt SQL-style query interfaces, enabling pattern predicates over sequence attributes (e.g., texts and biological sequences) alongside vector similarity search becomes essential. In this paper, we formulate a novel problem: given a set of vectors each associated with a sequence, retrieve the nearest vectors whose sequences contain a given query pattern. To address this challenge, we propose VectorMaton, an automaton-based index that integrates pattern filtering with efficient vector search, while maintaining an index size comparable to the dataset size. Extensive experiments on real-world datasets demonstrate that VectorMaton consistently outperforms all baselines, achieving up to 10x higher query throughput at the same accuracy and up to 18x reduction in index size.