DemoBot: Efficient Learning of Bimanual Manipulation with Dexterous Hands From Third-Person Human Videos

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of enabling dual-arm, multi-fingered dexterous hands to efficiently learn complex bimanual coordination skills from a single unannotated human demonstration video. We propose DemoBot, a framework that extracts structured motion trajectories of both hands and the manipulated object from third-person RGB-D videos to serve as priors that guide reinforcement learning in fine-tuning contact-rich interactive behaviors. Our approach innovatively integrates temporally segmented reinforcement learning, a success-gated reset strategy, and an event-driven adaptive reward curriculum to effectively overcome the difficulties of long-horizon bimanual manipulation. Experiments demonstrate successful execution of both synchronous and asynchronous long-horizon bimanual assembly tasks, validating the scalability and efficiency of directly transferring complex manipulation skills from human videos.

Technology Category

Application Category

📝 Abstract

This work presents DemoBot, a learning framework that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long-horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos.

Problem

Research questions and friction points this paper is trying to address.

bimanual manipulation

dexterous hands

video-based learning

long-horizon tasks

human demonstration

Innovation

Methods, ideas, or system contributions that make the work stand out.

bimanual manipulation

dexterous hands

video-based imitation learning