🤖 AI Summary
Real-world optical flow estimation suffers from significant domain shift due to reliance on animation-synthesized training data, limiting model generalization. To address this, we propose the first unsupervised framework that autonomously generates high-quality optical flow training data from single-view images—requiring neither video sequences nor manual annotations. Our method introduces three key innovations: (1) an object-agnostic volumetric rendering mechanism that ensures geometrically consistent scene motion modeling; (2) a depth-aware image inpainting module that improves flow accuracy at motion boundaries and occluded regions; and (3) the FA-Flow dataset—a large-scale, synthetic yet realistic optical flow benchmark built upon our framework. Experiments demonstrate state-of-the-art performance across multiple optical flow benchmarks, surpassing both existing unsupervised and synthetic-supervised methods. Moreover, models trained on FA-Flow significantly enhance downstream tasks, including video frame interpolation and action recognition.
📝 Abstract
Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose extbf{Flow-Anything}, a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world. We employ two effective steps to make data scaling-up promising. First, we convert a single-view image into a 3D representation using advanced monocular depth estimation networks. This allows us to render optical flow and novel view images under a virtual camera. Second, we develop an Object-Independent Volume Rendering module and a Depth-Aware Inpainting module to model the dynamic objects in the 3D representation. These two steps allow us to generate realistic datasets for training from large-scale single-view images, namely extbf{FA-Flow Dataset}. For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images, outperforming the most advanced unsupervised methods and supervised methods on synthetic datasets. Moreover, our models serve as a foundation model and enhance the performance of various downstream video tasks.