🤖 AI Summary
Existing static 3D reconstruction methods struggle to effectively model the time-varying nature of dynamic scenes in autonomous driving. To address this limitation, this work proposes DynamicVGGT, a novel framework that achieves dynamic 4D reconstruction within a feedforward 3D architecture for the first time. The approach jointly predicts current and future point maps under a shared reference coordinate system, implicitly learning dynamic point representations. It further introduces a motion-aware temporal attention (MTA) module and a dynamic 3D Gaussian splatting head driven by learnable motion tokens to explicitly model point-wise motion and enforce temporal consistency. Combined with scene flow supervision and continuous Gaussian optimization, the method significantly outperforms existing approaches on multiple autonomous driving benchmarks, delivering high-accuracy and robust feedforward reconstruction of dynamic scenes.
📝 Abstract
Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.