This&That: Language-Gesture Controlled Video Generation for Robot Planning

📅 2024-07-08
🏛️ arXiv.org
📈 Citations: 11
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of ambiguous natural-language instructions, weak controllability of visual planning, and visual–action mapping distortion in complex, uncertain environments, this paper proposes a language–gesture jointly driven video generation and execution framework. Methodologically: (1) we introduce the first language–gesture co-conditioned video diffusion model, enabling fine-grained, temporally aligned, and controllable visual plan generation; (2) we design Diffusion Video to Action (DiVA), a behavior cloning architecture that directly maps generated videos to high-fidelity robot actions. Our approach integrates multimodal conditional modeling, gesture–semantic alignment, and diffusion-based video generation. Evaluated on a multi-task robotic planning benchmark, our method significantly outperforms existing video-based planning and behavior cloning approaches—improving instruction understanding robustness by 23.6% and action execution accuracy by 19.4%.

Technology Category

Application Category

📝 Abstract
Clear, interpretable instructions are invaluable when attempting any complex task. Good instructions help to clarify the task and even anticipate the steps needed to solve it. In this work, we propose a robot learning framework for communicating, planning, and executing a wide range of tasks, dubbed This&That. This&That solves general tasks by leveraging video generative models, which, through training on internet-scale data, contain rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intent, and 3) translating visual plans into robot actions. This&That uses language-gesture conditioning to generate video predictions, as a succinct and unambiguous alternative to existing language-only methods, especially in complex and uncertain environments. These video predictions are then fed into a behavior cloning architecture dubbed Diffusion Video to Action (DiVA), which outperforms prior state-of-the-art behavior cloning and video-based planning methods by substantial margins.
Problem

Research questions and friction points this paper is trying to address.

Enabling unambiguous task communication via language-gesture instructions
Controlling video generation to align with user intent
Translating visual plans into actionable robot behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-gesture controlled video generation
Diffusion Video to Action (DiVA) architecture
Video generative models for planning
🔎 Similar Papers
No similar papers found.