🤖 AI Summary
Monocular depth estimation often suffers from blurry predictions in low-texture regions due to insufficient photometric loss signals. To address this, this work proposes a self-supervised method that enhances spatial structure by applying a distance transform on pre-computed semantic contours and integrating this transformed representation into both the input and the loss function design. The approach jointly optimizes depth, ego-motion, and contour estimation. Theoretical analysis demonstrates that the distance transform constitutes an optimal variance-augmentation strategy for low-texture scenarios. Extensive experiments show that the proposed method significantly outperforms existing self-supervised monocular depth estimation approaches across multiple benchmarks, including KITTI, Cityscapes, Waymo, NYUv2, and ScanNet.
📝 Abstract
Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI, Cityscapes, Waymo, NYUv2 and ScanNet our model demonstrates robust performance, surpassing competing self-supervised methods in MDE.