🤖 AI Summary
This work addresses the challenge of scaling 3D/4D reconstruction to large-scale dynamic real-world scenes, where existing methods rely on costly dense geometric and pose annotations. The authors propose a scalable visual geometry learning framework that leverages unlabeled monocular videos, using dense 2D optical flow as a self-supervised signal. By introducing a factorized optical flow prediction mechanism, the approach decouples geometry and camera pose into separate latent representations, enabling their joint optimization. This design naturally facilitates cooperative learning of scene geometry and motion, making it well-suited for dynamic environments. The method achieves state-of-the-art performance across eight benchmarks encompassing both static and dynamic scenes, with particularly significant gains on unlabeled real-world dynamic videos, and is trained using approximately 800,000 unlabeled video clips.
📝 Abstract
Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.