π€ AI Summary
This work addresses the limitations of unsupervised optical flow estimation, which suffers from accuracy constraints due to reliance on brightness constancy and smoothness assumptions, as well as the absence of ground-truth annotations. To overcome these challenges, we propose a generative unsupervised learning framework that leverages a pretrained deep network to produce pseudo optical flow as conditional input, guiding an image generation model to synthesize high-fidelity, pixel-wise aligned frameβflow data pairs for self-supervised training. Furthermore, we introduce an inconsistent-pixel filtering mechanism to enhance model robustness in real-world scenarios. Our approach is the first to utilize generative models to construct precisely aligned optical flow supervision signals, achieving performance on par with or superior to existing unsupervised and semi-supervised methods on the KITTI 2012, KITTI 2015, and Sintel benchmarks.
π Abstract
Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.