Sum Estimation via Vector Similarity Search

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently estimating aggregate quantities—such as kernel density estimates or softmax normalization constants—over large-scale datasets. The authors propose a novel algorithm grounded in vector similarity search, which leverages a hierarchical random assignment scheme with exponentially decaying probabilities, combined with multi-layer approximate nearest neighbor structures and top-k retrieval. This approach achieves unbiased estimation by accessing only $O(\log n)$ most similar vectors and provides high-probability relative error guarantees. Compared to existing methods requiring $O(\sqrt{n})$ operations, the proposed technique substantially reduces computational overhead. Experimental results on OpenImages and Amazon Reviews datasets demonstrate consistently lower estimation errors and superior runtime efficiency across tasks including density estimation, softmax denominator computation, and spherical range counting.

Technology Category

Application Category

📝 Abstract
Semantic embeddings to represent objects such as image, text and audio are widely used in machine learning and have spurred the development of vector similarity search methods for retrieving semantically related objects. In this work, we study the sibling task of estimating a sum over all objects in a set, such as the kernel density estimate (KDE) and the normalizing constant for softmax distributions. While existing solutions provably reduce the sum estimation task to acquiring $\mathcal{O}(\sqrt{n})$ most similar vectors, where $n$ is the number of objects, we introduce a novel algorithm that only requires $\mathcal{O}(\log(n))$ most similar vectors. Our approach randomly assigns objects to levels with exponentially-decaying probabilities and constructs a vector similarity search data structure for each level. With the top-$k$ objects from each level, we propose an unbiased estimate of the sum and prove a high-probability relative error bound. We run experiments on OpenImages and Amazon Reviews with a vector similar search implementation to show that our method can achieve lower error using less computational time than existing reductions. We show results on applications in estimating densities, computing softmax denominators, and counting the number of vectors within a ball.
Problem

Research questions and friction points this paper is trying to address.

sum estimation
vector similarity search
kernel density estimation
softmax normalization
semantic embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

vector similarity search
sum estimation
kernel density estimation
softmax normalization
logarithmic complexity
🔎 Similar Papers
No similar papers found.
Stephen Mussmann
Stephen Mussmann
Assistant Professor, Computer Science, Georgia Institute of Technology
active learningexperiment designdata-centric machine learning
M
Mehul Smriti Raje
Coactive AI
K
Kavya Tumkur
Coactive AI
O
Oumayma Messoussi
Desjardins
C
Cyprien Hachem
Coactive AI
S
Seby Jacob
Coactive AI