🤖 AI Summary
PointPillars achieves high efficiency but suffers from limited accuracy due to geometric information loss in its pseudo-image representation and incompatibility with two-stage 3D detection paradigms. To address this, we propose 3DPillars—the first two-stage 3D object detector built upon pseudo-image representations. Our method introduces a novel CNN architecture that stacks voxel features into pseudo-images and leverages 2D convolutions for efficient 3D geometric modeling. We further design a separable voxel feature module and a sparse-scene-aware RoI head to enable precise region proposal generation and context-aware feature enhancement. By integrating pseudo-image encoding, multi-scale feature aggregation, and sparse RoI processing, 3DPillars retains PointPillars’ real-time inference speed while significantly improving detection accuracy. Extensive experiments on KITTI and Waymo Open Dataset demonstrate that 3DPillars achieves performance competitive with state-of-the-art 3D detectors, striking an optimal balance between speed and accuracy.
📝 Abstract
PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene. Albeit efficient, PointPillars is typically outperformed by state-of-the-art 3D detection methods due to the following limitations: 1) The pseudo image representations fail to preserve precise 3D structures, and 2) they make it difficult to adopt a two-stage detection pipeline using 3D object proposals that typically shows better performance than a single-stage approach. We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations, narrowing the performance gaps between PointPillars and state-of-the-art methods, while retaining its efficiency. Our framework consists of two novel components that overcome the aforementioned limitations of PointPillars: First, we introduce a new CNN architecture, dubbed 3DPillars, that enables learning 3D voxel-based features from the pseudo image representation efficiently using 2D convolutions. The basic idea behind 3DPillars is that 3D features from voxels can be viewed as a stack of pseudo images. To implement this idea, we propose a separable voxel feature module that extracts voxel-based features without using 3D convolutions. Second, we introduce an RoI head with a sparse scene context feature module that aggregates multi-scale features from 3DPillars to obtain a sparse scene feature. This enables adopting a two-stage pipeline effectively, and fully leveraging contextual information of a scene to refine 3D object proposals. Experimental results on the KITTI and Waymo Open datasets demonstrate the effectiveness and efficiency of our approach, achieving a good compromise in terms of speed and accuracy.