🤖 AI Summary
This work addresses the challenge of enabling robots to learn multi-step manipulation tasks from a single human demonstration—without additional model training or manual annotation. We propose an end-to-end one-shot imitation learning framework that employs a lightweight vision-to-action mapping network built upon a pre-trained visual encoder. To enhance cross-task generalization, we incorporate contrastive learning and design the architecture to support plug-and-play integration of diverse backbone models. Our key contribution is the first demonstration of high-performance, fine-tuning-free, annotation-free one-shot imitation for long-horizon multi-step tasks—overcoming dual bottlenecks in task length scalability and deployment efficiency inherent in prior methods. Experiments show average success rates of 82.5% on multi-step tasks and 90% on single-step tasks, substantially outperforming baselines while maintaining computational efficiency and architectural extensibility.
📝 Abstract
Recent advances in one-shot imitation learning have enabled robots to acquire new manipulation skills from a single human demonstration. While existing methods achieve strong performance on single-step tasks, they remain limited in their ability to handle long-horizon, multi-step tasks without additional model training or manual annotation. We propose a method that can be applied to this setting provided a single demonstration without additional model training or manual annotation. We evaluated our method on multi-step and single-step manipulation tasks where our method achieves an average success rate of 82.5% and 90%, respectively. Our method matches and exceeds the performance of the baselines in both these cases. We also compare the performance and computational efficiency of alternative pre-trained feature extractors within our framework.