Towards Consistent Video Geometry Estimation

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the recovery of spatially dense and temporally coherent geometric information—including depth, surface normals, and point maps—from video sequences. To this end, we propose ViGeo, a feed-forward foundation model built entirely upon a Transformer architecture, which introduces a novel dynamic chunked attention mechanism that unifies causal and bidirectional temporal modeling, enabling flexible inference for streaming, full-sequence, and long-video scenarios. By leveraging a video depth completion teacher model together with multi-view geometric constraints, our approach constructs geometrically consistent and high-density supervision signals, allowing joint prediction of depth, normals, and point maps. Trained exclusively on publicly available datasets, ViGeo achieves state-of-the-art performance across multiple geometric estimation tasks under online, offline, and long-video evaluation settings.
📝 Abstract
This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.
Problem

Research questions and friction points this paper is trying to address.

video geometry estimation
temporal consistency
depth estimation
surface normals
point map estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic chunking attention
video depth completion
temporally consistent geometry
foundation model
surface normal estimation
🔎 Similar Papers