🤖 AI Summary
To address two key bottlenecks in monocular 3D object detection— inaccurate depth estimation and insufficient 3D structural awareness in depth representation—this paper proposes RD3D. Methodologically, it introduces (1) the Depth Thickness Field (DTF), a novel continuous representation modeling both scene depth distribution and thickness dimension; (2) a decoupled scene-level and instance-level optimization: Scene-level Depth Redirection (SDR) ensures geometric consistency of the DTF, while Instance-level Spatial Refinement (ISR) resolves voxel occupancy ambiguity; and (3) a depth-adaptive architecture with multi-scale depth distribution fusion. Evaluated on KITTI and Waymo, RD3D achieves significant improvements over state-of-the-art methods, particularly enhancing 3D detection AP and robustness under long-range and occluded scenarios. Moreover, it demonstrates strong generalization across diverse depth estimators.
📝 Abstract
Monocular 3D object detection is challenging due to the lack of accurate depth. However, existing depth-assisted solutions still exhibit inferior performance, whose reason is universally acknowledged as the unsatisfactory accuracy of monocular depth estimation models. In this paper, we revisit monocular 3D object detection from the depth perspective and formulate an additional issue as the limited 3D structure-aware capability of existing depth representations ( extit{e.g.}, depth one-hot encoding or depth distribution). To address this issue, we propose a novel depth-adapted monocular 3D object detection network, termed extbf{RD3D}, that mainly comprises a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module. The former incorporates the scene-level perception of 3D structures, retargeting traditional depth representations to a new formulation: extbf{Depth Thickness Field}. The latter refines the voxel spatial representation with the guidance of instances, eliminating the ambiguity of 3D occupation and thus improving detection accuracy. Extensive experiments on the KITTI and Waymo datasets demonstrate our superiority to existing state-of-the-art (SoTA) methods and the universality when equipped with different depth estimation models. The code will be available.