🤖 AI Summary
Depth estimation from videos has long suffered from the scarcity and scalability limitations of real-world depth annotations, resulting in unstable model performance. To address this, we propose the first scalable synthetic data generation pipeline, producing 40,000 high-fidelity, 5-second video clips with dense, pixel-accurate depth supervision. We further design a hybrid-length training strategy and a generative video diffusion model incorporating rotation-based positional encoding, enabling inference on arbitrary-length sequences (1–150 frames) and variable frame rates. To stabilize training dynamics, we adopt flow matching optimization, and introduce a novel depth interpolation scheme to enhance inter-frame consistency. Our method achieves state-of-the-art performance across both spatial accuracy and temporal coherence, outperforming all existing generative depth models. The codebase and pretrained weights are publicly released.
📝 Abstract
Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse virtual environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates-even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency. The code and model weights are open-sourced.