FlashFPS: Efficient Farthest Point Sampling for Large-Scale Point Clouds via Pruning and Caching

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the significant latency introduced by Farthest Point Sampling (FPS) in large-scale point cloud neural network inference, which has become a key scalability bottleneck. The study systematically identifies three types of redundancy in FPS for the first time: redundant computations over the entire point cloud, redundancy in later sampling iterations, and predictable outputs across network layers. To mitigate these inefficiencies, the authors propose FlashFPS, a hardware-agnostic and plug-and-play acceleration framework comprising two core components: FPS-Prune, which leverages candidate-point selection and iterative pruning, and FPS-Cache, which enables cross-layer caching and reuse of FPS outputs. Evaluated on both GPU and PNN accelerators, FlashFPS achieves speedups of 5.16× and 2.69×, respectively, with negligible accuracy loss, substantially enhancing inference efficiency.

Technology Category

Application Category

📝 Abstract

Point-based Neural Networks (PNNs) have become a key approach for point cloud processing. However, a core operation in these models, Farthest Point Sampling (FPS), often introduces significant inference latency, especially for large-scale processing. Despite existing CUDA- and hardware-level optimizations, FPS remains a major bottleneck due to exhaustive computations across multiple network layers in PNNs, which hinders scalability. Through systematic analysis, we identify three substantial redundancies in FPS, including unnecessary full-cloud computations, redundant late-stage iterations, and predictable inter-layer outputs that make later FPS computations avoidable. To address these, we propose \textbf{\textit{FlashFPS}}, a hardware-agnostic, plug-and-play framework for FPS acceleration, composed of \textit{FPS-Prune} and \textit{FPS-Cache}. \textit{FPS-Prune} introduces candidate pruning and iteration pruning to reduce redundant computations in FPS while preserving sampling quality, and \textit{FPS-Cache} eliminates layer-wise redundancy via cache-and-reuse. Integrated into existing CUDA libraries and state-of-the-art PNN accelerators, \textit{FlashFPS} achieves 5.16$\times$ speedup over the standard CUDA baseline on GPU and 2.69$\times$ on PNN accelerators, with negligible accuracy loss, enabling efficient and scalable PNN inference. Codes are released at https://github.com/Yuzhe-Fu/FlashFPS.

Problem

Research questions and friction points this paper is trying to address.

Farthest Point Sampling

Point Cloud

Inference Latency

Scalability

Redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Farthest Point Sampling

Point Cloud Acceleration

Redundancy Elimination