🤖 AI Summary
Existing self-supervised multi-camera depth estimation methods struggle with geometric inconsistencies across views caused by the structural complexity and motion coupling of articulated vehicles. This work proposes ArticuSurDepth, a novel framework that, for the first time, integrates multi-view spatial context enhancement, cross-view surface normal constraints, ground-aware camera height regularization, and cross-body pose consistency mechanisms. Leveraging structural priors from vision foundation models, the approach enables self-supervised learning of surround-view depth for articulated vehicles. The method substantially improves structural coherence and metric accuracy of depth estimates, achieving state-of-the-art performance on both a newly curated articulated vehicle dataset and established public benchmarks including DDAD, nuScenes, and KITTI.
📝 Abstract
Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.