Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Dual-arm cooperative manipulation poses significant challenges for generalization from single-arm vision-language-action (VLA) models due to high-dimensional action spaces, intricate inter-arm coordination requirements, and scarcity of real-world demonstration data. To address this, we propose a novel optical-flow-guided text-to-video generation paradigm, introducing the first “text → optical flow → video” two-stage decomposition architecture. Optical flow serves as a differentiable, motion-explicit intermediate representation that decouples language intent understanding from physical motion modeling, thereby substantially improving action-semantic alignment accuracy. Crucially, our method eliminates reliance on large-scale dual-arm demonstration datasets; instead, it achieves effective fine-tuning using only a small number of simulated or real-robot trajectories. Integrating diffusion-based policy networks, optical flow prediction, and text-to-video generation, our approach is rigorously validated on both simulation and real-world dual-arm robotic platforms, demonstrating strong generalization capability, high inter-arm coordination fidelity, and exceptional data efficiency.

Technology Category

Application Category

📝 Abstract

Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Learning generalizable bimanual manipulation policies for embodied agents

Overcoming limitations of single-arm datasets and VLA models

Reducing robot-data requirements via flow-based video prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes text-to-video models for robot trajectory prediction

Uses optical flow as intermediate movement representation

Trains lightweight diffusion policy for action generation

🔎 Similar Papers

Vision-based Manipulation from Single Human Video with Open-World Object Graphs