🤖 AI Summary
This work addresses the limited generalization of dual-arm robotic policies learned from real-world demonstrations, which stems from high data collection costs and insufficient visual diversity. To overcome these challenges, the authors propose CRAFT, a novel framework that leverages video diffusion models for generating realistic dual-arm manipulation sequences. CRAFT conditions a pretrained video diffusion Transformer on Canny edge structures extracted from simulated trajectories, along with action labels, to produce temporally coherent, physically plausible, and visually diverse videos. Notably, it enables Sim2Real data augmentation without requiring real-robot replay and uniformly supports enhancements such as viewpoint, illumination, background, embodiment transfer, and multi-view synthesis. Experiments demonstrate that CRAFT significantly improves policy success rates in both simulated and real dual-arm tasks, outperforming existing data augmentation approaches.
📝 Abstract
Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/