3DPillars: Pillar-based two-stage 3D object detection

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

PointPillars achieves high efficiency but suffers from limited accuracy due to geometric information loss in its pseudo-image representation and incompatibility with two-stage 3D detection paradigms. To address this, we propose 3DPillars—the first two-stage 3D object detector built upon pseudo-image representations. Our method introduces a novel CNN architecture that stacks voxel features into pseudo-images and leverages 2D convolutions for efficient 3D geometric modeling. We further design a separable voxel feature module and a sparse-scene-aware RoI head to enable precise region proposal generation and context-aware feature enhancement. By integrating pseudo-image encoding, multi-scale feature aggregation, and sparse RoI processing, 3DPillars retains PointPillars’ real-time inference speed while significantly improving detection accuracy. Extensive experiments on KITTI and Waymo Open Dataset demonstrate that 3DPillars achieves performance competitive with state-of-the-art 3D detectors, striking an optimal balance between speed and accuracy.

Technology Category

Application Category

📝 Abstract

PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene. Albeit efficient, PointPillars is typically outperformed by state-of-the-art 3D detection methods due to the following limitations: 1) The pseudo image representations fail to preserve precise 3D structures, and 2) they make it difficult to adopt a two-stage detection pipeline using 3D object proposals that typically shows better performance than a single-stage approach. We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations, narrowing the performance gaps between PointPillars and state-of-the-art methods, while retaining its efficiency. Our framework consists of two novel components that overcome the aforementioned limitations of PointPillars: First, we introduce a new CNN architecture, dubbed 3DPillars, that enables learning 3D voxel-based features from the pseudo image representation efficiently using 2D convolutions. The basic idea behind 3DPillars is that 3D features from voxels can be viewed as a stack of pseudo images. To implement this idea, we propose a separable voxel feature module that extracts voxel-based features without using 3D convolutions. Second, we introduce an RoI head with a sparse scene context feature module that aggregates multi-scale features from 3DPillars to obtain a sparse scene feature. This enables adopting a two-stage pipeline effectively, and fully leveraging contextual information of a scene to refine 3D object proposals. Experimental results on the KITTI and Waymo Open datasets demonstrate the effectiveness and efficiency of our approach, achieving a good compromise in terms of speed and accuracy.

Problem

Research questions and friction points this paper is trying to address.

Improving 3D object detection accuracy with pseudo image representations

Enabling two-stage detection pipelines using 3D object proposals

Preserving precise 3D structures while maintaining computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage 3D detection with pseudo images

Separable voxel feature module without 3D convolutions

Sparse scene context feature module for refinement

🔎 Similar Papers

No similar papers found.