FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model

📅 2024-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing world model frameworks for general-purpose robotic manipulation lack scalability across diverse objects, robot configurations, and task types. Method: This paper proposes a vision-flow-based world model planning framework that takes a language instruction and an initial image as input, using dense optical flow as a unified action representation to jointly model long-horizon visual dynamics and semantic intent. The approach integrates multimodal optical flow generation, flow-conditioned video synthesis, and vision-language joint representation learning, with internal planning conducted via reward-maximizing search. Contribution/Results: To our knowledge, this is the first framework enabling generalizable world modeling and interactive reasoning across heterogeneous manipulation tasks. Experiments demonstrate significant improvements in success rate and physical plausibility of long-horizon video plans across multiple benchmarks. Moreover, the learned world model effectively supports training of downstream low-level control policies, validating its efficacy and strong generalization capability.

Technology Category

Application Category

📝 Abstract
We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.Video demos are on our website: https://nus-lins-lab.github.io/flipweb/.
Problem

Research questions and friction points this paper is trying to address.

Developing scalable model-based planning for manipulation tasks
Generating long-horizon plans using vision and language inputs
Improving robot control through synthesized flow and video plans
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal flow generation model
Flow-conditioned video generation model
Vision-language representation learning model
C
Chongkai Gao
National University of Singapore
H
Haozhuo Zhang
Peking University
Zhixuan Xu
Zhixuan Xu
National University of Singapore
Robotics
Z
Zhehao Cai
National University of Singapore
L
Lin Shao
National University of Singapore