🤖 AI Summary
This work addresses the challenge of accurately predicting language-conditioned, spatially dense future optical flow from noisy and unstructured real-world videos to support applications in robotic control and video generation. To this end, the authors propose FOFPred, a novel model that uniquely integrates a unified vision-language architecture with a diffusion framework. The model is pretrained on large-scale web-collected video-text data and enhanced through careful data curation and alignment to improve cross-domain generalization. Experimental results demonstrate that FOFPred significantly outperforms existing approaches in both language-guided robotic manipulation and video generation tasks, thereby validating its effectiveness and versatility in multimodal reasoning and pixel-level motion prediction.
📝 Abstract
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.