🤖 AI Summary
Existing point-based 3D reconstruction methods (e.g., DUSt3R) suffer significant performance degradation in dynamic scenes due to object motion. To address this, we propose the first end-to-end 4D point-cloud modeling framework that explicitly incorporates the temporal dimension to jointly regress static and dynamic geometric structures. Our method comprises: (1) spatiotemporal joint feature encoding; (2) feedforward regression of 4D dense correspondences across frames; and (3) dynamic-static decoupled representation learning. Evaluated on multiple real-world and synthetic datasets with complex motion, our approach consistently outperforms baselines—including DUSt3R—in reconstruction accuracy, motion segmentation consistency, and cross-frame robustness. Notably, it achieves single-pass, feedforward 4D reconstruction for dynamic scenes without requiring motion priors or iterative optimization—marking the first such capability in the literature.
📝 Abstract
We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose D^2USt3R that regresses 4D pointmaps that simultaneiously capture both static and dynamic 3D scene geometry in a feed-forward manner. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates spatio-temporal dense correspondence to the proposed 4D pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior reconstruction performance across various datasets featuring complex motions.