🤖 AI Summary
To address the heavy reliance of 3D object detection in autonomous driving on expensive LiDAR sensors, this paper proposes a pure-vision monocular/biocular 2D-to-3D lifting method. The approach introduces an end-to-end geometrically constrained lifting pipeline that projects 2D detection bounding boxes into 3D space. Crucially, it incorporates, for the first time, a lightweight 2D CNN encoder that directly processes geometric point-cloud features associated with each 2D detection—achieving high accuracy while significantly improving computational efficiency. Furthermore, the method enables joint 3D modeling of all major road users, including vehicles, pedestrians, and cyclists. Evaluated on the KITTI benchmark, it achieves state-of-the-art (SOTA) performance among image-only methods at the time, with inference speed tripled (i.e., runtime reduced to one-third that of competing approaches). This work provides an efficient, cost-effective technical pathway toward vision-only autonomous driving.
📝 Abstract
Image-based 3D object detection is an inevitable part of autonomous driving because cheap onboard cameras are already available in most modern cars. Because of the accurate depth information, currently most state-of-the-art 3D object detectors heavily rely on LiDAR data. In this paper, we propose a pipeline which lifts the results of existing vision-based 2D algorithms to 3D detections using only cameras as a cost-effective alternative to LiDAR. In contrast to existing approaches, we focus not only on cars but on all types of road users. To the best of our knowledge, we are the first using a 2D CNN to process the point cloud for each 2D detection to keep the computational effort as low as possible. Our evaluation on the challenging KITTI 3D object detection benchmark shows results comparable to state-of-the-art image-based approaches while having a runtime of only a third.