Video Depth without Video Models

πŸ“… 2024-11-28
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Monocular video depth estimation suffers from inter-frame flickering and depth discontinuities induced by motion, while existing video foundation models incur high training/inference costs, exhibit poor 3D consistency, and are constrained by limited output length. To address these challenges without resorting to dedicated video models, we propose an efficient long-video depth estimation framework: we extend a single-image latent diffusion model (LDM) into a frame-triplet depth estimator, integrate multi-frame joint depth prediction with optimization-driven spatiotemporal registration, and design a robust cross-frame-rate depth fragment stitching algorithm. Fine-tuned on synthetic data, our method significantly outperforms state-of-the-art video- and single-frame-based depth models on videos spanning hundreds of frames. It achieves simultaneous improvements in depth accuracy, 3D geometric consistency, and end-to-end inference efficiency.

Technology Category

Application Category

πŸ“ Abstract
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.
Problem

Research questions and friction points this paper is trying to address.

Estimating dense depth in video frames without video models
Addressing temporal continuity issues in video depth estimation
Enhancing depth accuracy and consistency in long videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses single-image latent diffusion model
Implements multi-frame depth estimator
Applies optimization-based registration algorithm
πŸ”Ž Similar Papers
No similar papers found.