VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

📅 2024-10-11
🏛️ arXiv.org
📈 Citations: 23
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the cross-modal translation challenge from human demonstration videos to robotic task planning. We propose SeeDo, an end-to-end “watch video → generate plan” paradigm. Methodologically, it integrates keyframe selection, multi-frame visual encoding, and instruction-tuned vision-language models (e.g., LLaVA-1.5) within a unified inference pipeline to directly produce structured, executable task plans. Our key contribution is the first demonstration of semantic alignment from long-horizon, real-world manipulation videos to robot action sequences—without manual annotations or intermediate representations. Evaluated on three realistic pick-and-place tasks, SeeDo significantly outperforms video-input VLM baselines in plan quality. Generated plans successfully drive both simulated and physical robotic arms, achieving a 32% improvement in task success rate.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo's superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.
Problem

Research questions and friction points this paper is trying to address.

Interpreting human demonstration videos for robot task planning
Generating robot action plans from visual human demonstrations
Translating human demo videos into executable robot instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interprets human demo videos via VLM
Generates robot task planning from videos
Integrates keyframe selection and visual perception
🔎 Similar Papers
No similar papers found.
Beichen Wang
Beichen Wang
PhD Candidate at Wageningen University & Research
Natural Language ProcessingInformation RetrievalComplex Network
Juexiao Zhang
Juexiao Zhang
CS PhD student at New York Univeristy
Machine LearningComputer VisionRobotics
S
Shuwen Dong
New York University, New York, NY 11201, USA
I
Irving Fang
New York University, New York, NY 11201, USA
C
Chen Feng
New York University, New York, NY 11201, USA