Zero-Shot Monocular Scene Flow Estimation in the Wild

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular scene flow methods suffer from poor generalization, heavy reliance on large-scale real-world annotated data, and unnatural parameterization—hindering deployment in unseen artificial environments. This paper introduces the first zero-shot monocular scene flow estimation framework for野外 (in-the-wild) settings without target-domain training. Our approach addresses these limitations through three key innovations: (1) a novel joint geometric-motion modeling mechanism; (2) a scalable synthetic data recipe generating over one million cross-scenario samples for robust data augmentation; and (3) a physically grounded scene flow parameterization. Built upon a depth-motion joint optimization neural architecture and zero-shot transfer learning, our method achieves state-of-the-art performance in 3D endpoint error—outperforming all prior methods and large-model baselines. Crucially, it demonstrates strong zero-shot generalization on real-world DAVIS videos and RoboTAP robotic manipulation scenes, validating its practical applicability in unstructured, unseen environments.

Technology Category

Application Category

📝 Abstract
Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each.First, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.
Problem

Research questions and friction points this paper is trying to address.

Large-scale models
Scene flow prediction
Adaptability to unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shape and Motion Estimation
Large-scale Scene Flow Dataset
Parameter Optimization