DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based methods for long-video depth estimation employ sliding-window strategies, leading to inter-window scale inconsistency and error accumulation; moreover, their exclusive reliance on 2D diffusion priors neglects the intrinsic 3D geometric structure of videos, causing geometric distortions in depth predictions. To address these issues, we propose a training-free diffusion-guided framework featuring dual guidance mechanisms: scale guidance—enforcing cross-window scale synchronization to suppress cumulative bias—and geometry guidance—embedding 3D structural priors directly into the denoising process to ensure intra-window geometric alignment. Our approach significantly improves scale consistency and geometric coherence of depth maps across long videos. Extensive evaluations on multiple benchmark datasets demonstrate substantial performance gains over existing windowed diffusion methods, achieving state-of-the-art results in both quantitative metrics and qualitative visual fidelity.

Technology Category

Application Category

📝 Abstract
Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
Problem

Research questions and friction points this paper is trying to address.

Addressing scale discrepancies in long video depth estimation
Overcoming geometric inconsistency in video depth predictions
Enhancing depth consistency without additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework using diffusion guidance
Scale guidance synchronizes depth across windows
Geometry guidance enforces 3D geometric alignment
🔎 Similar Papers
No similar papers found.