CRAFT: Video Diffusion for Bimanual Robot Data Generation

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of dual-arm robotic policies learned from real-world demonstrations, which stems from high data collection costs and insufficient visual diversity. To overcome these challenges, the authors propose CRAFT, a novel framework that leverages video diffusion models for generating realistic dual-arm manipulation sequences. CRAFT conditions a pretrained video diffusion Transformer on Canny edge structures extracted from simulated trajectories, along with action labels, to produce temporally coherent, physically plausible, and visually diverse videos. Notably, it enables Sim2Real data augmentation without requiring real-robot replay and uniformly supports enhancements such as viewpoint, illumination, background, embodiment transfer, and multi-view synthesis. Experiments demonstrate that CRAFT significantly improves policy success rates in both simulated and real dual-arm tasks, outperforming existing data augmentation approaches.
📝 Abstract
Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/
Problem

Research questions and friction points this paper is trying to address.

bimanual robot learning
demonstration data scarcity
visual diversity
policy robustness
Sim2Real
Innovation

Methods, ideas, or system contributions that make the work stand out.

video diffusion
bimanual manipulation
data augmentation
Sim2Real
action-consistent generation
🔎 Similar Papers
No similar papers found.
J
Jason Chen
Thomas Lord Department of Computer Science, University of Southern California
I
I-Chun Arthur Liu
Thomas Lord Department of Computer Science, University of Southern California
Gaurav Sukhatme
Gaurav Sukhatme
Professor, Departments of CS and ECE, USC
RoboticsArtificial IntelligenceRobot NetworksMotion PlanningMachine Learning
Daniel Seita
Daniel Seita
University of Southern California
RoboticsMachine Learning