Human Demonstrations are Generalizable Knowledge for Robots

📅 2023-12-05

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing imitation learning approaches decompose human demonstration videos into raw action sequences, limiting cross-task and cross-object generalization. Method: We propose a hierarchical knowledge distillation framework that extracts three levels of generalizable knowledge from videos: low-level observational representations, mid-level action structures, and high-level task–object patterns. We further design a knowledge-retrieval-augmented LLM-based planner integrated with a closed-loop policy execution module, enabling knowledge-aware reasoning and feedback-driven correction. Contribution/Results: This work introduces the first method to elevate human demonstrations into structured, transferable, general-purpose knowledge. In real-robot experiments across multiple tasks, our approach achieves substantial improvements in cross-instance generalization success rates using only a few demonstrations—effectively overcoming a fundamental generalization bottleneck in imitation learning.

📝 Abstract

Learning from human demonstrations is an emerging trend for designing intelligent robotic systems. However, previous methods typically regard videos as instructions, simply dividing them into action sequences for robotic repetition, which poses obstacles to generalization to diverse tasks or object instances. In this paper, we propose a different perspective, considering human demonstration videos not as mere instructions, but as a source of knowledge for robots. Motivated by this perspective and the remarkable comprehension and generalization capabilities exhibited by large language models (LLMs), we propose DigKnow, a method that DIstills Generalizable KNOWledge with a hierarchical structure. Specifically, DigKnow begins by converting human demonstration video frames into observation knowledge. This knowledge is then subjected to analysis to extract human action knowledge and further distilled into pattern knowledge compassing task and object instances, resulting in the acquisition of generalizable knowledge with a hierarchical structure. In settings with different tasks or object instances, DigKnow retrieves relevant knowledge for the current task and object instances. Subsequently, the LLM-based planner conducts planning based on the retrieved knowledge, and the policy executes actions in line with the plan to achieve the designated task. Utilizing the retrieved knowledge, we validate and rectify planning and execution outcomes, resulting in a substantial enhancement of the success rate. Experimental results across a range of tasks and scenes demonstrate the effectiveness of this approach in facilitating real-world robots to accomplish tasks with the knowledge derived from human demonstrations.

Problem

Research questions and friction points this paper is trying to address.

Generalizing robotic learning from human demonstrations across diverse tasks

Converting video frames into hierarchical knowledge for robot comprehension

Enhancing task success via LLM-based planning and knowledge retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical knowledge distillation from human videos

LLM-based planning with retrieved knowledge

Validation and rectification of execution outcomes

🔎 Similar Papers

No similar papers found.