Approximate Diverse $k$-nearest Neighbor Search in Vector Database

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Approximate k-nearest neighbor (k-ANN) search in vector databases often yields redundant, insufficiently diverse, and inflexible results. Method: This paper proposes a progressive diversity optimization framework that incurs no additional indexing overhead. It explicitly incorporates diversity constraints into state-of-the-art k-ANN pipelines via a three-stage mechanism—iterative search, dynamic deduplication, and similarity verification—enabling joint optimization of result size and diversity in a single retrieval pass. Users can flexibly specify both the desired result count and diversity strength. Results: Experiments on million-scale benchmarks (LAION-art, Deep1M, Txt2img) demonstrate that, under medium-to-high diversity settings, the method improves recall and mean similarity significantly while incurring less than 5% latency overhead, closely approaching theoretical optimality. It is the first approach to unify accuracy, efficiency, and controllability in k-ANN search.

Technology Category

Application Category

📝 Abstract

Approximate $k$-nearest neighbor search (A$k$-NNS) is a core operation in vector databases, underpinning applications such as retrieval-augmented generation (RAG) and image retrieval. In these scenarios, users often prefer diverse result sets to minimize redundancy and enhance information value. However, existing greedy-based diverse methods frequently yield sub-optimal results, failing to adequately approximate the optimal similarity score under certain diversification level. Furthermore, there is a need for flexible algorithms that can adapt to varying user-defined result sizes and diversity requirements. To address these challenges, we propose a novel approach that seamlessly integrates result diversification into state-of-the-art (SOTA) A$k$-NNS methods. Our approach introduces a progressive search framework, consisting of iterative searching, diversification, and verification phases. Carefully designed diversification and verification steps enable our approach to efficiently approximate the optimal diverse result set according to user-specified diversification levels without additional indexing overhead. We evaluate our method on three million-scale benchmark datasets, LAION-art, Deep1M, and Txt2img, using latency, similarity, and recall as performance metrics across a range of $k$ values and diversification thresholds. Experimental results demonstrate that our approach consistently retrieves near-optimal diverse results with minimal latency overhead, particularly under medium and high diversity settings.

Problem

Research questions and friction points this paper is trying to address.

Improving diversity in approximate k-nearest neighbor search results

Overcoming sub-optimal performance of existing greedy diversification methods

Developing flexible algorithms for varying result sizes and diversity needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive search framework with iterative phases

Diversification integrated into SOTA approximate k-NNS

Efficiently approximates optimal diverse results without indexing overhead

🔎 Similar Papers

Effective and General Distance Computation for Approximate Nearest Neighbor Search