DemoBot: Efficient Learning of Bimanual Manipulation with Dexterous Hands From Third-Person Human Videos

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling dual-arm, multi-fingered dexterous hands to efficiently learn complex bimanual coordination skills from a single unannotated human demonstration video. We propose DemoBot, a framework that extracts structured motion trajectories of both hands and the manipulated object from third-person RGB-D videos to serve as priors that guide reinforcement learning in fine-tuning contact-rich interactive behaviors. Our approach innovatively integrates temporally segmented reinforcement learning, a success-gated reset strategy, and an event-driven adaptive reward curriculum to effectively overcome the difficulties of long-horizon bimanual manipulation. Experiments demonstrate successful execution of both synchronous and asynchronous long-horizon bimanual assembly tasks, validating the scalability and efficiency of directly transferring complex manipulation skills from human videos.

Technology Category

Application Category

📝 Abstract
This work presents DemoBot, a learning framework that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long-horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos.
Problem

Research questions and friction points this paper is trying to address.

bimanual manipulation
dexterous hands
video-based learning
long-horizon tasks
human demonstration
Innovation

Methods, ideas, or system contributions that make the work stand out.

bimanual manipulation
dexterous hands
video-based imitation learning
reinforcement learning
motion priors
🔎 Similar Papers
No similar papers found.
Y
Yucheng Xu
ByteDance Seed
Xiaofeng Mao
Xiaofeng Mao
Alibaba Group
Computer VisionAdversarial Machine Learning
E
Elle Miller
ByteDance Seed
X
Xinyu Yi
ByteDance Seed
Y
Yang Li
ByteDance Seed
Zhibin Li
Zhibin Li
Professor in School of Transportation, Southeast University
Intelligent Transportation SystemTraffic ControlTraffic SafetyTraffic FlowData Mining
R
Robert B. Fisher
ByteDance Seed