4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In dynamic scene geometry estimation, spatiotemporal feature heterogeneity leads to representation mismatch when employing a unified implicit space. To address this, we propose 4D-VGGT—the first spatiotemporally aware general foundation model for dynamic geometry reconstruction. Its core innovation is a disentangled spatiotemporal representation architecture: (i) adaptive visual grids for multi-view input encoding; (ii) cross-view global fusion and cross-temporal local fusion to decouple spatial and temporal modeling; and (iii) hierarchical spatiotemporal feature fusion with multi-task prediction heads. Trained jointly on large-scale geometric datasets, 4D-VGGT achieves state-of-the-art performance across multiple dynamic scene geometry benchmarks—including DyCheck and DynamicReplica—demonstrating significant improvements in geometric reconstruction accuracy, discriminative capability, and cross-task generalization.

Technology Category

Application Category

📝 Abstract

We investigate a challenging task of dynamic scene geometry estimation, which requires representing both spatial and temporal features. Typically, existing methods align the two features into a unified latent space to model scene geometry. However, this unified paradigm suffers from potential mismatched representation due to the heterogeneous nature between spatial and temporal features. In this work, we propose 4D-VGGT, a general foundation model with divide-and-conquer spatiotemporal representation for dynamic scene geometry. Our model is divided into three aspects: 1) Multi-setting input. We design an adaptive visual grid that supports input sequences with arbitrary numbers of views and time steps. 2) Multi-level representation. We propose a cross-view global fusion for spatial representation and a cross-time local fusion for temporal representation. 3) Multi-task prediction. We append multiple task-specific heads to spatiotemporal representations, enabling a comprehensive visual geometry estimation for dynamic scenes. Under this unified framework, these components enhance the feature discriminability and application universality of our model for dynamic scenes. In addition, we integrate multiple geometry datasets to train our model and conduct extensive experiments to verify the effectiveness of our method across various tasks on multiple dynamic scene geometry benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Estimating dynamic scene geometry with spatiotemporal awareness

Resolving mismatched spatial-temporal feature representations

Developing multi-task foundation model for geometry estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-conquer spatiotemporal representation for dynamic scenes

Adaptive visual grid for arbitrary view and time inputs

Cross-view global and cross-time local fusion modules

🔎 Similar Papers

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion