🤖 AI Summary
Self-supervised monocular depth estimation suffers significant performance degradation in low-texture and dynamic regions. To address this, we propose an uncertainty-aware teacher-student framework that integrates visual odometry and optical-flow-guided motion modeling into self-supervised training. The teacher network leverages optical flow to strengthen geometric constraints in weak-texture regions, while the student network employs uncertainty-aware masking to suppress interference from dynamic or unreliable pixels during joint depth and pose optimization. Our method requires no ground-truth depth labels, additional annotations, or inference-time overhead, enabling end-to-end robust depth and pose estimation. Evaluated on KITTI and Cityscapes, it achieves state-of-the-art performance—particularly improving depth accuracy at dynamic object boundaries and textureless regions—and concurrently enhances pose estimation accuracy.
📝 Abstract
Monocular depth estimation has been increasingly adopted in robotics and autonomous driving for its ability to infer scene geometry from a single camera. In self-supervised monocular depth estimation frameworks, the network jointly generates and exploits depth and pose estimates during training, thereby eliminating the need for depth labels. However, these methods remain challenged by uncertainty in the input data, such as low-texture or dynamic regions, which can cause reduced depth accuracy. To address this, we introduce UM-Depth, a framework that combines motion- and uncertainty-aware refinement to enhance depth accuracy at dynamic object boundaries and in textureless regions. Specifically, we develop a teacherstudent training strategy that embeds uncertainty estimation into both the training pipeline and network architecture, thereby strengthening supervision where photometric signals are weak. Unlike prior motion-aware approaches that incur inference-time overhead and rely on additional labels or auxiliary networks for real-time generation, our method uses optical flow exclusively within the teacher network during training, which eliminating extra labeling demands and any runtime cost. Extensive experiments on the KITTI and Cityscapes datasets demonstrate the effectiveness of our uncertainty-aware refinement. Overall, UM-Depth achieves state-of-the-art results in both self-supervised depth and pose estimation on the KITTI datasets.