H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Imitation learning for robotic manipulation is hindered by the scarcity of high-quality demonstration data and challenges in cross-embodiment transfer. To address this, we propose H-RDT—a novel framework that pioneers the use of large-scale first-person human manipulation videos and 3D hand pose data for pretraining dual-arm robotic policies. In Stage I, a 2-billion-parameter diffusion Transformer models complex action distributions, augmented with optical flow matching to strengthen temporal modeling. In Stage II, a modular action encoder-decoder enables efficient cross-embodiment fine-tuning. Evaluated in both simulation and real-world settings, H-RDT achieves +13.9% and +40.5% performance gains over from-scratch training, respectively, and significantly outperforms baselines including Pi0 and RDT. Under few-shot conditions, it demonstrates superior generalization and robustness, empirically validating the effective transfer of human behavioral priors to dual-arm robotic manipulation.

Technology Category

Application Category

📝 Abstract
Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.
Problem

Research questions and friction points this paper is trying to address.

Scarcity of large-scale high-quality robot demonstration data
Challenges in unified training across diverse robot embodiments
Leveraging human manipulation data to enhance robot capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages human manipulation data for robots
Two-stage training with human and robot data
Uses diffusion transformer with flow matching
🔎 Similar Papers
No similar papers found.
H
Hongzhe Bi
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Lingxuan Wu
Lingxuan Wu
Tsinghua University
Embodied IntelligenceAI Safety
Tianwei Lin
Tianwei Lin
Zhejiang University
MLLMs
Hengkai Tan
Hengkai Tan
Tsinghua University
Reinforcement LearningRobot LearningEmbodied AIDeep Generative Models
Zhizhong Su
Zhizhong Su
Horizon Robotics
Deep LearningComputer VisionAutonomous DrivingRobotics Learning
H
Hang Su
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
J
Jun Zhu
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University