Distributed-Memory Parallel Algorithms for Fixed-Radius Near Neighbor Graph Construction

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scalability limitations of fixed-radius nearest-neighbor graph (FRNNG) construction for large-scale scientific data under non-Euclidean metrics and high-dimensional settings, this paper proposes the first sparsity-aware, distributed-memory parallel algorithm designed for general metric spaces. Our method models neighborhood structure via cover trees and integrates dual partitioning strategies—point-based and space-based—while leveraging shared-memory optimizations for efficient intra-node parallelism. It supports exact nearest-neighbor search without assuming Euclidean geometry, significantly improving efficiency on high-dimensional non-Euclidean data. On million-point high-dimensional datasets, the algorithm achieves a 678.34× speedup using 1,024 cores (average degree 70) and 1,590.99× speedup using 4,096 cores (average degree 500), substantially outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Computing fixed-radius near-neighbor graphs is an important first step for many data analysis algorithms. Near-neighbor graphs connect points that are close under some metric, endowing point clouds with a combinatorial structure. As computing power and data acquisition methods advance, diverse sources of large scientific datasets would greatly benefit from scalable solutions to this common subroutine for downstream analysis. Prior work on parallel nearest neighbors has made great progress in problems like k-nearest and approximate nearest neighbor search problems, with particular attention on Euclidean spaces. Yet many applications need exact solutions and non-Euclidean metrics. This paper presents a scalable sparsity-aware distributed memory algorithm using cover trees to compute near-neighbor graphs in general metric spaces. We provide a shared-memory algorithm for cover tree construction and demonstrate its competitiveness with state-of-the-art fixed-radius search data structures. We then introduce two distributed-memory algorithms for the near-neighbor graph problem, a simple point-partitioning strategy and a spatial-partitioning strategy, which leverage the cover tree algorithm on each node. Our algorithms exhibit parallel scaling across a variety of real and synthetic datasets for both traditional and non-traditional metrics. On real world high dimensional datasets with one million points, we achieve speedups up to 678.34x over the state-of-the-art using 1024 cores for graphs with 70 neighbors per vertex (on average), and up to 1590.99x using 4096 cores for graphs with 500 neighbors per vertex (on average).
Problem

Research questions and friction points this paper is trying to address.

Develop scalable distributed algorithms for exact fixed-radius near-neighbor graphs
Handle non-Euclidean metrics and general metric spaces efficiently
Achieve high parallel speedups for large high-dimensional datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed memory algorithm using cover trees
Shared memory algorithm for cover tree construction
Two partitioning strategies for distributed near-neighbor graphs
🔎 Similar Papers
No similar papers found.