Generalist Bimanual Manipulation via Foundation Video Diffusion Models

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dual-arm robotic manipulation faces two critical bottlenecks: data scarcity and entity heterogeneity—severely limiting cross-platform and cross-task generalization. To address this, we propose VIDAR, a two-stage framework. Stage I employs unsupervised masked inverse dynamics modeling to extract action-semantic features from multi-view videos without pixel-level annotations. Stage II leverages a video diffusion model to jointly generate high-fidelity action-visual sequences. VIDAR unifies heterogeneous observation spaces via a robot-agnostic encoding scheme and enables embodiment-free action transfer. Evaluated on unseen tasks, backgrounds, and robotic platforms, VIDAR surpasses state-of-the-art methods using only 20 minutes of human demonstrations (~1% of typical dataset size). It demonstrates strong robustness and universality in semantic understanding, zero-shot transfer, and few-shot generalization.

Technology Category

Application Category

📝 Abstract
Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in bimanual robotic manipulation
Overcoming embodiment heterogeneity in robotic control
Enhancing generalization to unseen tasks and backgrounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages diffusion-based video pre-training
Uses masked inverse dynamics model
Unified observation space encoding
🔎 Similar Papers
No similar papers found.
Y
Yao Feng
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Hengkai Tan
Hengkai Tan
Tsinghua University
Reinforcement LearningRobot LearningEmbodied AIDeep Generative Models
Xinyi Mao
Xinyi Mao
Undergraduate, Tsinghua University
RoboticsEmbodied AI
Guodong Liu
Guodong Liu
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
S
Shuhe Huang
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Chendong Xiang
Chendong Xiang
First-year PHD student of computer science and technology, Tsinghua university
generate modelembodied AI
H
Hang Su
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
J
Jun Zhu
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University