Video Depth without Video Models

📅 2024-11-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Monocular video depth estimation suffers from inter-frame flickering and depth discontinuities induced by motion, while existing video foundation models incur high training/inference costs, exhibit poor 3D consistency, and are constrained by limited output length. To address these challenges without resorting to dedicated video models, we propose an efficient long-video depth estimation framework: we extend a single-image latent diffusion model (LDM) into a frame-triplet depth estimator, integrate multi-frame joint depth prediction with optimization-driven spatiotemporal registration, and design a robust cross-frame-rate depth fragment stitching algorithm. Fine-tuned on synthetic data, our method significantly outperforms state-of-the-art video- and single-frame-based depth models on videos spanning hundreds of frames. It achieves simultaneous improvements in depth accuracy, 3D geometric consistency, and end-to-end inference efficiency.

Technology Category

Application Category

📝 Abstract

Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.

Problem

Research questions and friction points this paper is trying to address.

Estimating dense depth in video frames without video models

Addressing temporal continuity issues in video depth estimation

Enhancing depth accuracy and consistency in long videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses single-image latent diffusion model

Implements multi-frame depth estimator

Applies optimization-based registration algorithm

🔎 Similar Papers

No similar papers found.

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

3D Computer Vision Researcher

Kitware

Arlington, Virginia

AI Research Scientist, Computer Vision - Facebook Video Intelligence