🤖 AI Summary
This work addresses the challenges of LiDAR-based 3D pedestrian detection under occlusion and in complex scenes by proposing an efficient, real-time, pure-LiDAR detection method. The approach encodes 3D point clouds into a lightweight 2D tensor using a height-aware three-band bird’s-eye-view (BEV) representation and employs a single-stage network to jointly detect vehicles, pedestrians, and cyclists. Key innovations include a bidirectional high-resolution feature pyramid (P1–P4), a region attention mechanism, a rotation-aware IoU loss, and distribution focal learning, which collectively enhance robustness and accuracy under occlusion. Evaluated on the KITTI dataset, the method achieves pedestrian BEV AP scores of 58.7, 52.6, and 47.2 on the easy, moderate, and hard subsets respectively at 49 FPS, substantially outperforming Complex-YOLO and demonstrating strong potential for real-world deployment.
📝 Abstract
Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.