GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitations of existing video depth estimation methods, which often suffer from spatial blurring in detailed regions and struggle to maintain strict 3D geometric consistency under large viewpoint changes. To overcome these challenges, the authors propose a novel framework that integrates explicit camera motion with global 3D structural awareness. The approach employs a Geometry Embedding Module (GEM) to predict inter-frame poses and introduces an Alternating Spatio-Temporal Transformer (ASTT) to enable effective spatio-temporal feature interaction, thereby enhancing spatial accuracy while preserving temporal consistency. Notably, this is the first method to incorporate explicit motion priors into depth estimation, complemented by a data-efficient training strategy. The proposed model achieves state-of-the-art performance across multiple benchmarks, demonstrating significant improvements—particularly in complex dynamic scenes—while maintaining computational efficiency and strong geometric consistency.

📝 Abstract

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth

Problem

Research questions and friction points this paper is trying to address.

video depth estimation

3D geometric consistency

temporal consistency

spatial blurring

camera motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-Embedding Module

3D-consistent depth estimation

camera pose prediction