🤖 AI Summary
This work addresses the scarcity of high-quality, densely annotated datasets with complete geometric ground truth for monocular video-based dense 3D reconstruction and tracking in dynamic scenes. To bridge this gap, we introduce a novel multi-view synthetic dynamic scene dataset that provides, for the first time, pixel-level spatiotemporally consistent 4D ground truth. The dataset includes accurate camera poses, depth maps, dense trajectories, and parameterized human body poses, enabling precise 3D back-projection of any pixel at arbitrary times and viewpoints. Built upon synthetic data generation, multi-view rendering, 3D back-projection, dense optical flow, and articulated human modeling, our dataset significantly advances performance in 4D reconstruction, 3D point tracking, geometry-aware camera relocalization, and human pose estimation, thereby filling a critical void in high-fidelity annotations for dynamic scenes.
📝 Abstract
Dense 3D reconstruction and tracking of dynamic scenes from monocular video remains an important open challenge in computer vision. Progress in this area has been constrained by the scarcity of high-quality datasets with dense, complete, and accurate geometric annotations. To address this limitation, we introduce Syn4D, a multiview synthetic dataset of dynamic scenes that includes ground-truth camera motion, depth maps, dense tracking, and parametric human pose annotations. A key feature of Syn4D is the ability to unproject any pixel into 3D to any time and to any camera. We conduct extensive evaluations across multiple downstream tasks to demonstrate the utility and effectiveness of the proposed dataset, including 4D scene reconstruction, 3D point tracking, geometry-aware camera retargeting, and human pose estimation. The experimental results highlight Syn4D's potential to facilitate research in dynamic scene understanding and spatiotemporal modeling.