🤖 AI Summary
Existing diffusion-based methods for long-video depth estimation employ sliding-window strategies, leading to inter-window scale inconsistency and error accumulation; moreover, their exclusive reliance on 2D diffusion priors neglects the intrinsic 3D geometric structure of videos, causing geometric distortions in depth predictions. To address these issues, we propose a training-free diffusion-guided framework featuring dual guidance mechanisms: scale guidance—enforcing cross-window scale synchronization to suppress cumulative bias—and geometry guidance—embedding 3D structural priors directly into the denoising process to ensure intra-window geometric alignment. Our approach significantly improves scale consistency and geometric coherence of depth maps across long videos. Extensive evaluations on multiple benchmark datasets demonstrate substantial performance gains over existing windowed diffusion methods, achieving state-of-the-art results in both quantitative metrics and qualitative visual fidelity.
📝 Abstract
Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.