🤖 AI Summary
This work addresses the challenge of coupled geometry and motion in dynamic 3D scenes by proposing a unified 4D reconstruction and tracking method centered on camera-space scene flow. Built upon a Vision Transformer architecture, the approach symmetrically models dual-view inputs through a shared decoder and jointly predicts 3D geometry, bidirectional scene flow, pose weights, and confidence scores in a single forward pass—eliminating the need for explicit pose regression or bundle adjustment. Trained end-to-end, the model uniformly handles both static and dynamic scene elements, achieving state-of-the-art performance in 4D reconstruction and tracking. The results validate the efficacy and superiority of a scene-flow-centric representation for spatiotemporal scene understanding.
📝 Abstract
Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.