Generalist Bimanual Manipulation via Foundation Video Diffusion Models

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Dual-arm robotic manipulation faces two critical bottlenecks: data scarcity and entity heterogeneity—severely limiting cross-platform and cross-task generalization. To address this, we propose VIDAR, a two-stage framework. Stage I employs unsupervised masked inverse dynamics modeling to extract action-semantic features from multi-view videos without pixel-level annotations. Stage II leverages a video diffusion model to jointly generate high-fidelity action-visual sequences. VIDAR unifies heterogeneous observation spaces via a robot-agnostic encoding scheme and enables embodiment-free action transfer. Evaluated on unseen tasks, backgrounds, and robotic platforms, VIDAR surpasses state-of-the-art methods using only 20 minutes of human demonstrations (~1% of typical dataset size). It demonstrates strong robustness and universality in semantic understanding, zero-shot transfer, and few-shot generalization.

Technology Category

Application Category

📝 Abstract

Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in bimanual robotic manipulation

Overcoming embodiment heterogeneity in robotic control

Enhancing generalization to unseen tasks and backgrounds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages diffusion-based video pre-training

Uses masked inverse dynamics model

Unified observation space encoding

🔎 Similar Papers

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos