🤖 AI Summary
Existing approaches struggle to simultaneously handle dynamic scenes, multi-view inputs, and unknown camera poses within a single feedforward pass. This work proposes NoPo4D, the first system capable of reconstructing dynamic 3D scenes from feedforward multi-view video without requiring known camera poses. Its key innovations include integrating a pretrained geometric backbone with a 4D Gaussian representation, decomposing motion into image-plane displacement and depth change, and employing a bidirectional motion encoder supervised by pseudo optical flow—thereby circumventing reliance on differentiable rendering or ground-truth 3D motion. Evaluated on four dynamic multi-view benchmarks, NoPo4D significantly outperforms existing feedforward methods; with optional post-optimization, it even surpasses per-scene optimization approaches while achieving inference speeds orders of magnitude faster.
📝 Abstract
Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.