🤖 AI Summary
This work addresses the challenge of enabling dual-arm, multi-fingered dexterous hands to efficiently learn complex bimanual coordination skills from a single unannotated human demonstration video. We propose DemoBot, a framework that extracts structured motion trajectories of both hands and the manipulated object from third-person RGB-D videos to serve as priors that guide reinforcement learning in fine-tuning contact-rich interactive behaviors. Our approach innovatively integrates temporally segmented reinforcement learning, a success-gated reset strategy, and an event-driven adaptive reward curriculum to effectively overcome the difficulties of long-horizon bimanual manipulation. Experiments demonstrate successful execution of both synchronous and asynchronous long-horizon bimanual assembly tasks, validating the scalability and efficiency of directly transferring complex manipulation skills from human videos.
📝 Abstract
This work presents DemoBot, a learning framework that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long-horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos.