🤖 AI Summary
This work addresses the challenges of irregular computation and scalability bottlenecks in constructing k-nearest neighbor graphs on distributed GPU systems. The authors propose a scalable distributed framework that first builds local neighbor graphs from data shards on individual GPUs and then leverages MPI one-sided communication to fetch remote neighbors for global approximate nearest neighbor refinement. Notably, they introduce the first lock-free, single-node graph construction algorithm tailored for AMD GPUs. Their approach outperforms existing GPU-based methods on a single MI300A APU and achieves strong scaling speedups of 11× and 6.9× on 512 APUs for datasets of 1 billion and 2 billion points, respectively.
📝 Abstract
Neighbor graphs capture relationships among data points and are widely used in data analytics and AI workloads. Many studies have explored approximate construction methods for single-node systems, including GPUs. However, extending this to distributed systems for larger data and further acceleration remains challenging due to irregular computation patterns.
We present SOLANET, a GPU-accelerated distributed neighbor graph construction toolkit. SOLANET first constructs local graphs on each GPU after data partitioning and then refines them via approximate nearest neighbor (ANN) searches over remote graphs pulled from other GPUs using MPI one-sided operations. SOLANET also provides a lock-free single-GPU neighbor graph construction algorithm for AMD GPUs.
Our single-GPU implementation outperforms a state-of-the-art GPU-based approximate neighbor graph construction implementation across multiple datasets on a single MI300A APU. Furthermore, SOLANET demonstrates 11X speedup from 32 to 512 APUs for 1 billion data points and 6.9x speedup from 64 to 512 APUs for 2 billion points.