JZ-Tree: GPU friendly neighbour search and friends-of-friends with dual tree walks in JAX plus CUDA

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of traditional space-tree algorithms on GPUs, which suffer from thread divergence and irregular memory access that hinder parallel performance. To overcome these limitations, the authors propose a GPU-optimized Morton-ordered flat tree structure that leverages a flattened memory layout and z-order encoding to substantially improve memory coalescing and thread cooperation during dual-tree traversal. Implemented in JAX and CUDA, the approach enables highly efficient k-nearest neighbor search and Friends-of-Friends clustering, achieving over an order-of-magnitude speedup compared to existing GPU libraries on datasets with $N \gtrsim 10^7$ points, while also supporting strong multi-GPU distributed scaling.
📝 Abstract
Algorithms based on spatial tree traversal are widely regarded as among the most efficient and flexible approaches for many problems in CPU-based high-performance computing (HPC). However, directly transferring these algorithms to GPU architectures often yields substantially smaller performance gains than expected in light of the high computational throughput of modern GPUs. The branching nature of tree algorithms leads to thread divergence and irregular memory access patterns -- both of which may severely limit GPU performance. To address these challenges, we propose a Morton (z-order) 'plane-based tree hierarchy' that is specifically designed for GPU architectures. The resulting flattened data layout enables efficient dual-tree traversal with collaborative execution across thread groups, leading to highly coalesced memory access patterns. Based on this framework we present implementations of two important spatial algorithms -- exact $k$-nearest neighbour search and friends-of-friends (FoF) clustering. For both cases, we observe more than an order-of-magnitude performance improvement over the closest competing GPU libraries for large problem sizes ($N \gtrsim 10^7$), together with strong scaling to distributed multi-GPU systems. We provide an open-source implementation, 'JZ-Tree' (JAX z-order tree), which serves as a foundation for efficient GPU implementations of a broad class of tree-based algorithms.
Problem

Research questions and friction points this paper is trying to address.

GPU acceleration
spatial tree traversal
thread divergence
memory access patterns
nearest neighbour search
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-friendly tree traversal
Morton ordering
dual-tree walk
coalesced memory access
JAX CUDA implementation
🔎 Similar Papers
No similar papers found.
J
Jens Stücker
University of Vienna, Department of Astrophysics, Türkenschanzstraße 18, 1180 Vienna, Austria
Oliver Hahn
Oliver Hahn
PhD Student at TU Darmstadt
Computer VisionMachine Learning
L
Lukas Winkler
University of Vienna, Department of Astrophysics, Türkenschanzstraße 18, 1180 Vienna, Austria
A
Adrian Gutierrez Adame
University of Vienna, Department of Astrophysics, Türkenschanzstraße 18, 1180 Vienna, Austria
T
Thomas Flöss
University of Vienna, Department of Astrophysics, Türkenschanzstraße 18, 1180 Vienna, Austria; and University of Vienna, Department of Mathematics, Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria