Performance bounds for nearest neighbor search with k-d trees

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work elucidates the theoretical mechanisms underlying the performance degradation of k-d trees in high-dimensional nearest neighbor search, with a focus on delineating the effectiveness boundaries of defeatist and comprehensive search strategies. Under mild absolute continuity assumptions on the data distribution, the paper establishes, for the first time via non-asymptotic analysis, rigorous performance bounds for these two classical approaches, quantitatively characterizing the interplay among dimensionality, dataset size, search accuracy, and runtime. The main contributions include proving that when the dimension grows polylogarithmically with the number of data points, the accuracy of defeatist search deteriorates to that of random guessing, while comprehensive search nearly exhausts all tree cells. Furthermore, explicit upper bounds on the number of visited cells and distance error are derived for both uniform and general distributions, precisely identifying the critical conditions under which high-dimensional performance collapse occurs.

📝 Abstract

The $k$-d tree is one of the oldest and most widely used data structures for nearest neighbor search. It partitions Euclidean space into axis-aligned rectangular cells. There are two standard ways to find the nearest neighbor to a query in a $k$-d tree. Defeatist search returns the closest data point in the query's cell, while comprehensive search also searches other cells as needed to guarantee it finds the nearest neighbor. Both strategies are commonly believed to perform poorly in high dimensions, but there have been few theoretical results explaining this. We prove non-asymptotic bounds on the runtime of comprehensive search and the accuracy of defeatist search. Under mild distributional assumptions, when the dimension $d$ is at least polylogarithmic in the number of data points, defeatist search is no more likely to return the nearest neighbor than random guessing, and comprehensive search visits every cell with high probability. We also show that on uniform data, with high probability, comprehensive search visits at most $2^{\mathcal{O}(d)}$ cells when each cell contains at least logarithmically many data points, and defeatist search returns the nearest neighbor when each cell additionally contains at least $2^{\mathcal{O}(d \log d)}$ data points. Finally, for arbitrary absolutely continuous distributions, we upper bound the expected distance between the query and the point returned by defeatist search.

Problem

Research questions and friction points this paper is trying to address.

nearest neighbor search

k-d trees

high-dimensional data

search accuracy

computational complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

k-d tree

nearest neighbor search

high-dimensional analysis