PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dynamic scenes pose a fundamental conflict between camera pose estimation—which requires suppressing dynamic regions—and geometric reconstruction—which necessitates modeling dynamic elements. To resolve this, we propose the Dynamic-Aware Visual Geometric Grounding Transformer (DVG-Transformer). Our method introduces a dynamic-aware aggregation module and a dynamic mask prediction mechanism to decouple static and dynamic features: motion is suppressed during pose estimation to mitigate interference, while motion cues are explicitly leveraged for depth and point cloud reconstruction. We formulate an end-to-end multi-task framework supporting both monocular and video inputs. Experiments on 4D dynamic-scene reconstruction demonstrate that DVG-Transformer significantly outperforms baselines such as VGGT, enabling real-time, post-processing-free depth prediction, dense point cloud reconstruction, and high-accuracy camera localization—achieving a 23.6% reduction in pose error and an 18.4% improvement in depth accuracy.

Technology Category

Application Category

📝 Abstract
Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction -- all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask -- suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.
Problem

Research questions and friction points this paper is trying to address.

Extends 3D models to handle dynamic scenes with moving objects
Resolves conflict between pose estimation and geometry reconstruction tasks
Disentangles static and dynamic information using dynamics-aware masking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends VGGT to dynamic scenes for 4D perception
Uses dynamics-aware mask to disentangle static and dynamic information
Enables pose estimation and geometry reconstruction without post-processing
🔎 Similar Papers
No similar papers found.
K
Kaichen Zhou
MIT
Y
Yuhan Wang
Imperial College London
G
Grace Chen
Harvard University
X
Xinhai Chang
MIT
G
Gaspard Beaudouin
École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris
Fangneng Zhan
Fangneng Zhan
MIT
Neural RenderingGenerative Models
P
Paul Pu Liang
MIT
M
Mengyu Wang
Harvard University