SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robotic imitation learning in real-world scenarios faces challenges of requiring high precision, long-horizon execution, strong generalization, and yet extremely limited demonstrations—e.g., clothing folding, tabletop rearrangement, and towel hanging. Method: We propose a few-shot imitation learning framework grounded in semantic keypoint representation. Our approach introduces the first task-agnostic, automatic semantic keypoint extraction mechanism, leveraging vision foundation models (SAM/CLIP) and unsupervised keypoint detection to achieve cross-object, cross-environment, and cross-modal (including human video) semantic alignment and behavior cloning. Contribution/Results: With only 30 demonstrations, our method achieves 70% success rate on long-horizon tasks like towel hanging; grasping success for cups and mice improves by 100%; and it demonstrates strong robustness under object deformation, environmental perturbations, and variations in human pose.

Technology Category

Application Category

📝 Abstract
Real-world tasks such as garment manipulation and table rearrangement demand robots to perform generalizable, highly precise, and long-horizon actions. Although imitation learning has proven to be an effective approach for teaching robots new skills, large amounts of expert demonstration data are still indispensible for these complex tasks, resulting in high sample complexity and costly data collection. To address this, we propose Semantic Keypoint Imitation Learning (SKIL), a framework which automatically obtain semantic keypoints with help of vision foundation models, and forms the descriptor of semantic keypoints that enables effecient imitation learning of complex robotic tasks with significantly lower sample complexity. In real world experiments, SKIL doubles the performance of baseline methods in tasks such as picking a cup or mouse, while demonstrating exceptional robustness to variations in objects, environmental changes, and distractors. For long-horizon tasks like hanging a towel on a rack where previous methods fail completely, SKIL achieves a mean success rate of 70% with as few as 30 demonstrations. Furthermore, SKIL naturally supports cross-embodiment learning due to its semantic keypoints abstraction, our experiments demonstrate that even human videos bring considerable improvement to the learning performance. All these results demonstrate the great success of SKIL in achieving data-efficint generalizable robotic learning. Visualizations and code are available at: https://skil-robotics.github.io/SKIL-robotics/.
Problem

Research questions and friction points this paper is trying to address.

Robot Learning
Few-shot Learning
Complex Task Execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Keypoint Imitation Learning
Data-efficient Learning
Adaptability
🔎 Similar Papers
No similar papers found.