🤖 AI Summary
To address the challenge of LiDAR–RGB fusion for 3D object detection in outdoor scenes, this paper proposes an efficient multi-stream fusion network. The method employs three parallel branches to extract features from LiDAR pillar grids, bird’s-eye-view (BEV) representations, and UV-mapped RGB projections, respectively. Crucially, it introduces polar-coordinate indexing—a novel mechanism for cross-modal alignment—enabling joint modeling of geometric structure, texture details, and spatial layout. The architecture integrates LiDAR-PillarNet, height-compressed encoding, UV projection, polar-coordinate feature indexing, and multi-scale feature fusion, coupled with a two-stage detection head to enhance localization accuracy. Evaluated on the KITTI benchmark, the method achieves state-of-the-art (SOTA) or near-SOTA performance across multiple classes, with significant improvements in mean average precision (mAP), while maintaining real-time inference speed. The source code is publicly available.
📝 Abstract
Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection task still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, that meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird's-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural and geometric information are carefully fused and fed to a detection head for 3D object detection. Our extensive evaluation on the challenging KITTI Object Detection Benchmark using public testing server at https://www.cvlibs.net/datasets/kitti/eval_object_detail.php?&result=d162ec699d6992040e34314d19ab7f5c217075e0 establishes the efficacy of our method by achieving new state-of-the-art or highly competitive results in different categories while remaining among the most efficient methods. Our code will be released through MuStD GitHub repository at https://github.com/IbrahimUWA/MuStD.git