4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the limitations of existing 3D foundation models in monocular video-based dynamic 4D reconstruction, where performance is hindered by the entanglement of camera ego-motion and object motion within global attention mechanisms. To resolve this, the authors propose a training-free, progressive decoupling framework that separates static and dynamic scene content through a coarse-to-fine strategy. The approach integrates dynamic mask-guided pose estimation, orthogonal depth manifold decomposition, and heteroscedastic Bayesian inference. Furthermore, it introduces an information-theoretic, confidence-aware inverse-variance weighting fusion mechanism to enhance reconstruction fidelity. Evaluated on standard 4D reconstruction benchmarks, the method significantly improves point cloud quality and achieves robust, high-precision dynamic scene reconstruction without requiring fine-tuning.

📝 Abstract

Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.

Problem

Research questions and friction points this paper is trying to address.

4D reconstruction

dynamic scenes

monocular video

camera ego-motion

object motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-Static Disentanglement

Pose Decoupling

Topological Subspace Surgery