🤖 AI Summary
This work addresses the limitations of binary locality-sensitive hashing (LSH) in approximate nearest neighbor (ANN) search, where recall and efficiency are often suboptimal. To overcome this, the authors propose a dynamic query modification mechanism that adaptively transforms the original query into a new center point at query time, significantly increasing both the probability and stability of hash collisions with true neighbors. Building upon this mechanism, they design MQ-Forest, an ANN retrieval framework that integrates random projection techniques for enhanced efficiency. Extensive experiments demonstrate that MQ-Forest reduces indexing and query time by up to 40% compared to baseline methods across multiple large-scale, high-dimensional datasets. Notably, this is the first approach to incorporate dynamic query transformation into binary LSH, effectively balancing accuracy and computational efficiency.
📝 Abstract
Our context of interest is how binary locality sensitive hash (LSH) functions can be used to solve the approximate near neighbour (ANN) problem, which seeks to find the k closest elements of some dataset X to some further point q presented as a query. Binary locality sensitive function families H are sets of functions each which accept a point and return a binary value. A function is locality sensitive if and only if the output of the function is more likely to be equal (a 'hash collision') if two close vectors are used as input than if two far vectors are used. A data structure can be built by generating binary hash codes for each member of X, which are generated by drawing and applying one or more functions from H. When q is presented as a query, the same set of functions is applied to it and those elements of X with equal binary hash codes are retrieved. In this paper we introduce dynamic query modification. This process changes q at query time to form a new value c, which by theoretical and experimental analysis we prove has two significant advantages. Firstly, the hash output of c collides with near neighbours with a greater probability than q. Secondly, we show there is little chance of c failing to collide with any near neighbours; a property which we demonstrate is not true for q. To demonstrate the efficacy of the technique, we define a novel structure MQ-Forest, a modified version of RP-Forest. Both are binary LSH-based ANN mechanisms, but MQ-Forest dynamically estimates a value for c during the query process. We show that MQ-Forest reduces both build and query times by up to 40% when measured over several large, high-dimensional benchmark datasets.