Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address temporal inconsistency, accuracy degradation, and insufficient real-time performance in monocular depth estimation for long videos (several minutes), this paper proposes the first high-quality depth estimation framework tailored for ultra-long videos. Methodologically: (i) we introduce a temporal depth gradient consistency loss that requires no geometric priors—such as optical flow or camera pose; (ii) we design a keyframe-based inference strategy coupled with a lightweight spatiotemporal decoder for efficient modeling; and (iii) building upon Depth Anything V2, we perform zero-shot training jointly on unlabeled images and video data. Our method achieves zero-shot state-of-the-art performance across multiple video depth benchmarks, supports end-to-end inference on arbitrarily long videos, and attains up to 30 FPS real-time inference with the smallest model variant. It establishes an optimal trade-off among accuracy, temporal coherence, and cross-scene generalization.

Technology Category

Application Category

📝 Abstract

Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (<10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.

Problem

Research questions and friction points this paper is trying to address.

Depth Estimation

Long Video

Consistency and Real-time Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Depth Prediction

Real-time Processing

Temporal Stability

🔎 Similar Papers

Learning Temporally Consistent Video Depth from Video Diffusion Priors