🤖 AI Summary
Efficient and consistent reconstruction of dense 4D dynamic scenes from pose-free image pairs remains challenging. This work proposes UFO-4D, a unified feedforward framework that jointly estimates dynamic 3D Gaussian splats, camera poses, and motion from only two unposed images. It is the first method to achieve self-supervised, multi-signal coupled 4D reconstruction using a single dynamic 3D Gaussian representation, coherently modeling appearance, depth, and motion through differentiable rendering and self-supervised image synthesis losses. Experiments demonstrate up to a threefold improvement in joint estimation accuracy for geometry, motion, and camera pose, while enabling high-quality novel view synthesis and temporal interpolation.
📝 Abstract
Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: https://ufo-4d.github.io/