Efficient Sketching and Nearest Neighbor Search Algorithms for Sparse Vector Sets

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This paper addresses approximate nearest neighbor search (ANNS) over high-dimensional sparse vector collections. Methodologically, it proposes a novel retrieval framework that integrates theory-driven sparse vector sketching, geometry-aware inverted indexing, and a local-global information fusion strategy, augmented by an inner-product-preserving dimensionality reduction scheme tailored to sparse embeddings. Experiments on multiple large-scale sparse benchmark datasets demonstrate that the method achieves sub-millisecond average query latency on a single CPU while significantly outperforming state-of-the-art approaches in Recall@k. Notably, this work establishes the first principled integration of sketching theory with geometric modeling of inverted indices for sparse ANNS, yielding a new paradigm that offers both theoretical guarantees and practical efficiency.

Technology Category

Application Category

📝 Abstract

Sparse embeddings of data form an attractive class due to their inherent interpretability: Every dimension is tied to a term in some vocabulary, making it easy to visually decipher the latent space. Sparsity, however, poses unique challenges for Approximate Nearest Neighbor Search (ANNS) which finds, from a collection of vectors, the k vectors closest to a query. To encourage research on this underexplored topic, sparse ANNS featured prominently in a BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on large benchmark datasets by throughput and accuracy. In this work, we introduce a set of novel data structures and algorithmic methods, a combination of which leads to an elegant, effective, and highly efficient solution to sparse ANNS. Our contributions range from a theoretically-grounded sketching algorithm for sparse vectors to reduce their effective dimensionality while preserving inner product-induced ranks; a geometric organization of the inverted index; and the blending of local and global information to improve the efficiency and efficacy of ANNS. Empirically, our final algorithm, dubbed Seismic, reaches sub-millisecond per-query latency with high accuracy on a large-scale benchmark dataset using a single CPU.

Problem

Research questions and friction points this paper is trying to address.

Developing efficient sketching algorithms for sparse vector dimensionality reduction

Designing geometric indexing structures for sparse approximate nearest neighbor search

Creating hybrid methods combining local and global information for ANNS optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sketching algorithm reduces sparse vector dimensionality

Geometric organization optimizes inverted index structure

Blending local and global information enhances search efficiency

🔎 Similar Papers

Effective and General Distance Computation for Approximate Nearest Neighbor Search