Revisiting Monocular 3D Object Detection from Scene-Level Depth Retargeting to Instance-Level Spatial Refinement

📅 2024-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key bottlenecks in monocular 3D object detection— inaccurate depth estimation and insufficient 3D structural awareness in depth representation—this paper proposes RD3D. Methodologically, it introduces (1) the Depth Thickness Field (DTF), a novel continuous representation modeling both scene depth distribution and thickness dimension; (2) a decoupled scene-level and instance-level optimization: Scene-level Depth Redirection (SDR) ensures geometric consistency of the DTF, while Instance-level Spatial Refinement (ISR) resolves voxel occupancy ambiguity; and (3) a depth-adaptive architecture with multi-scale depth distribution fusion. Evaluated on KITTI and Waymo, RD3D achieves significant improvements over state-of-the-art methods, particularly enhancing 3D detection AP and robustness under long-range and occluded scenarios. Moreover, it demonstrates strong generalization across diverse depth estimators.

Technology Category

Application Category

📝 Abstract
Monocular 3D object detection is challenging due to the lack of accurate depth. However, existing depth-assisted solutions still exhibit inferior performance, whose reason is universally acknowledged as the unsatisfactory accuracy of monocular depth estimation models. In this paper, we revisit monocular 3D object detection from the depth perspective and formulate an additional issue as the limited 3D structure-aware capability of existing depth representations ( extit{e.g.}, depth one-hot encoding or depth distribution). To address this issue, we propose a novel depth-adapted monocular 3D object detection network, termed extbf{RD3D}, that mainly comprises a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module. The former incorporates the scene-level perception of 3D structures, retargeting traditional depth representations to a new formulation: extbf{Depth Thickness Field}. The latter refines the voxel spatial representation with the guidance of instances, eliminating the ambiguity of 3D occupation and thus improving detection accuracy. Extensive experiments on the KITTI and Waymo datasets demonstrate our superiority to existing state-of-the-art (SoTA) methods and the universality when equipped with different depth estimation models. The code will be available.
Problem

Research questions and friction points this paper is trying to address.

3D object detection
monocular camera
depth estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

RD3D Network
Depth Thickness Field
Single Eye 3D Object Detection
🔎 Similar Papers
No similar papers found.
Q
Qiude Zhang
Institute of Information Science, Beijing Jiaotong University, Beijing Key Laboratory of Advanced Information Science and Network Technology
C
Chunyu Lin
Institute of Information Science, Beijing Jiaotong University, Beijing Key Laboratory of Advanced Information Science and Network Technology
Zhijie Shen
Zhijie Shen
Beijing Jiaotong University
Computer Vision
N
Nie Lang
Institute of Information Science, Beijing Jiaotong University, Beijing Key Laboratory of Advanced Information Science and Network Technology
Y
Yao Zhao
Institute of Information Science, Beijing Jiaotong University, Beijing Key Laboratory of Advanced Information Science and Network Technology