🤖 AI Summary
Approximate k-nearest neighbor (k-ANN) search in vector databases often yields redundant, insufficiently diverse, and inflexible results. Method: This paper proposes a progressive diversity optimization framework that incurs no additional indexing overhead. It explicitly incorporates diversity constraints into state-of-the-art k-ANN pipelines via a three-stage mechanism—iterative search, dynamic deduplication, and similarity verification—enabling joint optimization of result size and diversity in a single retrieval pass. Users can flexibly specify both the desired result count and diversity strength. Results: Experiments on million-scale benchmarks (LAION-art, Deep1M, Txt2img) demonstrate that, under medium-to-high diversity settings, the method improves recall and mean similarity significantly while incurring less than 5% latency overhead, closely approaching theoretical optimality. It is the first approach to unify accuracy, efficiency, and controllability in k-ANN search.
📝 Abstract
Approximate $k$-nearest neighbor search (A$k$-NNS) is a core operation in vector databases, underpinning applications such as retrieval-augmented generation (RAG) and image retrieval. In these scenarios, users often prefer diverse result sets to minimize redundancy and enhance information value. However, existing greedy-based diverse methods frequently yield sub-optimal results, failing to adequately approximate the optimal similarity score under certain diversification level. Furthermore, there is a need for flexible algorithms that can adapt to varying user-defined result sizes and diversity requirements.
To address these challenges, we propose a novel approach that seamlessly integrates result diversification into state-of-the-art (SOTA) A$k$-NNS methods. Our approach introduces a progressive search framework, consisting of iterative searching, diversification, and verification phases. Carefully designed diversification and verification steps enable our approach to efficiently approximate the optimal diverse result set according to user-specified diversification levels without additional indexing overhead.
We evaluate our method on three million-scale benchmark datasets, LAION-art, Deep1M, and Txt2img, using latency, similarity, and recall as performance metrics across a range of $k$ values and diversification thresholds. Experimental results demonstrate that our approach consistently retrieves near-optimal diverse results with minimal latency overhead, particularly under medium and high diversity settings.