DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing diffusion-based methods for long-video depth estimation employ sliding-window strategies, leading to inter-window scale inconsistency and error accumulation; moreover, their exclusive reliance on 2D diffusion priors neglects the intrinsic 3D geometric structure of videos, causing geometric distortions in depth predictions. To address these issues, we propose a training-free diffusion-guided framework featuring dual guidance mechanisms: scale guidance—enforcing cross-window scale synchronization to suppress cumulative bias—and geometry guidance—embedding 3D structural priors directly into the denoising process to ensure intra-window geometric alignment. Our approach significantly improves scale consistency and geometric coherence of depth maps across long videos. Extensive evaluations on multiple benchmark datasets demonstrate substantial performance gains over existing windowed diffusion methods, achieving state-of-the-art results in both quantitative metrics and qualitative visual fidelity.

Technology Category

Application Category

📝 Abstract

Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.

Problem

Research questions and friction points this paper is trying to address.

Addressing scale discrepancies in long video depth estimation

Overcoming geometric inconsistency in video depth predictions

Enhancing depth consistency without additional training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework using diffusion guidance

Scale guidance synchronizes depth across windows

Geometry guidance enforces 3D geometric alignment

🔎 Similar Papers

Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion