UM-Depth : Uncertainty Masked Self-Supervised Monocular Depth Estimation with Visual Odometry

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Self-supervised monocular depth estimation suffers significant performance degradation in low-texture and dynamic regions. To address this, we propose an uncertainty-aware teacher-student framework that integrates visual odometry and optical-flow-guided motion modeling into self-supervised training. The teacher network leverages optical flow to strengthen geometric constraints in weak-texture regions, while the student network employs uncertainty-aware masking to suppress interference from dynamic or unreliable pixels during joint depth and pose optimization. Our method requires no ground-truth depth labels, additional annotations, or inference-time overhead, enabling end-to-end robust depth and pose estimation. Evaluated on KITTI and Cityscapes, it achieves state-of-the-art performance—particularly improving depth accuracy at dynamic object boundaries and textureless regions—and concurrently enhances pose estimation accuracy.

Technology Category

Application Category

📝 Abstract

Monocular depth estimation has been increasingly adopted in robotics and autonomous driving for its ability to infer scene geometry from a single camera. In self-supervised monocular depth estimation frameworks, the network jointly generates and exploits depth and pose estimates during training, thereby eliminating the need for depth labels. However, these methods remain challenged by uncertainty in the input data, such as low-texture or dynamic regions, which can cause reduced depth accuracy. To address this, we introduce UM-Depth, a framework that combines motion- and uncertainty-aware refinement to enhance depth accuracy at dynamic object boundaries and in textureless regions. Specifically, we develop a teacherstudent training strategy that embeds uncertainty estimation into both the training pipeline and network architecture, thereby strengthening supervision where photometric signals are weak. Unlike prior motion-aware approaches that incur inference-time overhead and rely on additional labels or auxiliary networks for real-time generation, our method uses optical flow exclusively within the teacher network during training, which eliminating extra labeling demands and any runtime cost. Extensive experiments on the KITTI and Cityscapes datasets demonstrate the effectiveness of our uncertainty-aware refinement. Overall, UM-Depth achieves state-of-the-art results in both self-supervised depth and pose estimation on the KITTI datasets.

Problem

Research questions and friction points this paper is trying to address.

Addressing uncertainty in self-supervised monocular depth estimation

Enhancing depth accuracy in textureless and dynamic regions

Eliminating inference-time overhead for motion-aware refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty-aware teacher-student training strategy

Motion-aware refinement without inference overhead

Optical flow used only during teacher training

🔎 Similar Papers

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes