🤖 AI Summary
To address the real-time performance bottleneck of sampling-based motion planners (SBMPs) under high-dimensional dynamical constraints—stemming from their inherently sequential design—this paper introduces Kino-PAX, the first GPU-accelerated, highly concurrent kinodynamic sampling planner. Methodologically, Kino-PAX decouples RRT* tree expansion into three massively parallel subroutines, enabling direct, concurrent growth of trajectory segments; it employs a fully parallelized dynamical sampling architecture, with theoretical guarantees of probabilistic completeness. Crucially, it is the first to jointly optimize thread-level load balancing and hardware-aware scheduling. Evaluation shows planning latencies as low as 10 ms on desktop GPUs and ~100 ms on embedded GPUs—achieving up to 1000× speedup over state-of-the-art CPU-based coarse-grained parallel kinodynamic planners.
📝 Abstract
Sampling-based motion planners (SBMPs) are effective for planning with complex kinodynamic constraints in high-dimensional spaces, but they still struggle to achieve <italic>real-time</italic> performance, which is mainly due to their serial computation design. We present <italic>Kinodynamic Parallel Accelerated eXpansion</italic> (<italic>Kino-PAX</italic>), a novel highly parallel kinodynamic SBMP designed for parallel devices such as GPUs. <italic>Kino-PAX</italic> grows a tree of trajectory segments directly in parallel. Our key insight is how to decompose the iterative tree growth process into three massively parallel subroutines. <italic>Kino-PAX</italic> is designed to align with the parallel device execution hierarchies, through ensuring that threads are largely independent, share equal workloads, and take advantage of low-latency resources while minimizing high-latency data transfers and process synchronization. This design results in a very efficient GPU implementation. We prove that <italic>Kino-PAX</italic> is probabilistically complete and analyze its scalability with compute hardware improvements. Empirical evaluations demonstrate solutions in the order of 10 ms on a desktop GPU and in the order of 100 ms on an embedded GPU, representing up to <inline-formula><tex-math notation="LaTeX">$1000 imes$</tex-math></inline-formula> improvement compared to coarse-grained CPU parallelization of state-of-the-art sequential algorithms over a range of complex environments and systems.