Learning Temporally Consistent Video Depth from Video Diffusion Priors

📅 2024-06-03

🏛️ arXiv.org

📈 Citations: 41

✨ Influential: 7

career value

198K/year

🤖 AI Summary

This work addresses the insufficient inter-frame temporal consistency in streaming video depth estimation. We propose ChronoDepth, a framework that formulates depth prediction as a conditional generation task guided by a video diffusion prior, enabling coherent inference over arbitrarily long videos. Key contributions include: (i) the first cross-clip contextual modeling mechanism to capture long-range temporal dependencies; (ii) a sliding-window inference strategy coupled with noise-free reinitialization to ensure temporal stability for extended sequences; and (iii) an intra-clip consistent contextual training paradigm. The method integrates video diffusion models, inter-frame context distillation, and noise-level-independent sampling. Evaluated on multiple benchmarks, ChronoDepth achieves state-of-the-art performance, with a significant improvement in temporal depth map stability—evidenced by a 37% reduction in ΔE error.

Technology Category

Application Category

📝 Abstract

This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Thus, instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.

Problem

Research questions and friction points this paper is trying to address.

Estimating temporally consistent depth in streamed videos

Sharing contextual information between video frames

Conditional generation for cross-clip depth consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulate depth prediction as conditional generation

Consistent context-aware training for long videos

Sliding window strategy with noise sampling

🔎 Similar Papers

Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion