🤖 AI Summary
To address the challenge of rapidly adapting robotic manipulation skills, this paper proposes a hand-demonstration learning method that requires neither task annotations nor teleoperated robot demonstrations. Our approach introduces a novel hand-path-driven, two-stage cross-modal behavior retrieval mechanism: first, coarse filtering via visual tracking and appearance similarity; second, fine-grained retrieval of matching sub-trajectories from unlabeled autonomous exploration data based on temporal behavioral similarity. Crucially, the method operates without camera calibration or precise hand pose estimation, significantly lowering the human demonstration burden. Subsequently, only lightweight policy fine-tuning is required—averaging under four minutes per new task. Evaluated on a real robotic platform, our method achieves over a twofold improvement in task success rate compared to baseline approaches, demonstrating both high efficiency and practical applicability.
📝 Abstract
We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.