🤖 AI Summary
This paper addresses the degraded robustness of self-supervised monocular depth estimation under adverse weather conditions and imaging noise. We propose a diffusion-model-based robust self-supervised framework. Our key contributions are: (1) a hierarchical feature-guided denoising module that jointly leverages multi-scale visual features to enhance depth perception from blurred or noisy images; and (2) an implicit depth consistency loss that decouples reprojection constraints from inter-frame scale consistency within video sequences, eliminating reliance on ground-truth depth. The method is fully unsupervised and requires only monocular video sequences. Evaluated on KITTI and Make3D, our approach significantly outperforms existing generative methods, achieving simultaneous improvements in both depth accuracy and robustness against blur and noise.
📝 Abstract
Self-supervised monocular depth estimation has received widespread attention because of its capability to train without ground truth. In real-world scenarios, the images may be blurry or noisy due to the influence of weather conditions and inherent limitations of the camera. Therefore, it is particularly important to develop a robust depth estimation model. Benefiting from the training strategies of generative networks, generative-based methods often exhibit enhanced robustness. In light of this, we employ the generative-based diffusion model with a unique denoising training process for self-supervised monocular depth estimation. Additionally, to further enhance the robustness of the diffusion model, we probe into the influence of perturbations on image features and propose a hierarchical feature-guided denoising module. Furthermore, we explore the implicit depth within reprojection and design an implicit depth consistency loss. This loss function is not interfered by the other subnetwork, which can be targeted to constrain the depth estimation network and ensure the scale consistency of depth within a video sequence. We conduct experiments on the KITTI and Make3D datasets. The results indicate that our approach stands out among generative-based models, while also showcasing remarkable robustness.