StableDPT: Temporal Stable Monocular Video Depth Estimation

πŸ“… 2026-01-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the temporal inconsistency and flickering artifacts commonly observed when applying monocular depth estimation models directly to video sequences. To this end, the authors propose a lightweight, trainable temporal module that enhances temporal coherence by integrating global temporal context through cross-frame keyframe sampling and an efficient cross-attention mechanism. The method supports videos of arbitrary length while avoiding scale misalignment and redundant computation caused by overlapping windows. Built upon a Vision Transformer encoder and a Dense Prediction Transformer head, the architecture enables efficient inference. Experimental results demonstrate that the proposed approach achieves state-of-the-art performance across multiple benchmarks, significantly outperforms existing methods in temporal consistency, and delivers a 2Γ— speedup in practical inference speed.

Technology Category

Application Category

πŸ“ Abstract
Applying single image Monocular Depth Estimation (MDE) models to video sequences introduces significant temporal instability and flickering artifacts. We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing by integrating a new temporal module - trainable on a single GPU in a few days. Our architecture StableDPT builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head. The core of our contribution lies in the temporal layers within the head, which use an efficient cross-attention mechanism to integrate information from keyframes sampled across the entire video sequence. This allows the model to capture global context and inter-frame relationships leading to more accurate and temporally stable depth predictions. Furthermore, we propose a novel inference strategy for processing videos of arbitrary length avoiding the scale misalignment and redundant computations associated with overlapping windows used in other methods. Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

temporal instability
monocular depth estimation
video depth estimation
flickering artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal stability
monocular depth estimation
cross-attention
video depth estimation
efficient inference