🤖 AI Summary
Point cloud intelligent analysis suffers from pipeline stalls and high energy consumption caused by frequent off-chip memory accesses. This paper proposes the first hardware architecture supporting fully streaming point cloud processing, introducing two novel mechanisms—forced partitioning and deterministic termination—to eliminate off-chip data movement and enable automatic on-chip buffer sizing. By integrating the streaming computation paradigm, enhanced memory access locality, and domain-specific accelerator design, the architecture guarantees computational correctness while significantly improving efficiency. Compared to a baseline implementation, it reduces on-chip memory footprint by 61.3% and energy consumption by 40.5%. Against state-of-the-art point cloud accelerators, it achieves a 10.0× speedup and a 3.9× improvement in energy efficiency. The proposed architecture provides a scalable, low-power hardware solution for real-time point cloud processing.
📝 Abstract
Point clouds are increasingly important in intelligent applications, but frequent off-chip memory traffic in accelerators causes pipeline stalls and leads to high energy consumption. While conventional line buffer techniques can eliminate off-chip traffic, they cannot be directly applied to point clouds due to their inherent computation patterns. To address this, we introduce two techniques: compulsory splitting and deterministic termination, enabling fully-streaming processing. We further propose StreamGrid, a framework that integrates these techniques and automatically optimizes on-chip buffer sizes. Our evaluation shows StreamGrid reduces on-chip memory by 61.3% and energy consumption by 40.5% with marginal accuracy loss compared to the baselines without our techniques. Additionally, we achieve 10.0$ imes$ speedup and 3.9$ imes$ energy efficiency over state-of-the-art accelerators.