Towards Consistent Video Geometry Estimation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the recovery of spatially dense and temporally coherent geometric information—including depth, surface normals, and point maps—from video sequences. To this end, we propose ViGeo, a feed-forward foundation model built entirely upon a Transformer architecture, which introduces a novel dynamic chunked attention mechanism that unifies causal and bidirectional temporal modeling, enabling flexible inference for streaming, full-sequence, and long-video scenarios. By leveraging a video depth completion teacher model together with multi-view geometric constraints, our approach constructs geometrically consistent and high-density supervision signals, allowing joint prediction of depth, normals, and point maps. Trained exclusively on publicly available datasets, ViGeo achieves state-of-the-art performance across multiple geometric estimation tasks under online, offline, and long-video evaluation settings.

📝 Abstract

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

Problem

Research questions and friction points this paper is trying to address.

video geometry estimation

temporal consistency

depth estimation

surface normals

point map estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic chunking attention

video depth completion

temporally consistent geometry