DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of dexterous bimanual robotic manipulation, which is hindered by the scarcity of real-world data and the embodiment gap between human demonstration videos and robotic systems. To bridge this gap, the authors propose DexImit, a framework that, for the first time, automatically synthesizes physically plausible bimanual robot manipulation data from monocular human demonstration videos captured from arbitrary viewpoints—without requiring additional annotations or manual intervention. DexImit operates through a four-stage pipeline: hand-object interaction reconstruction, subtask decomposition with bimanual coordination scheduling, robot trajectory synthesis, and data augmentation. This approach enables zero-shot deployment on diverse real-world tasks, including tool use, long-horizon sequences, and fine manipulation. Experiments demonstrate that the generated data can be directly used to train real robots, significantly improving their generalization capabilities.

Technology Category

Application Category

📝 Abstract
Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).
Problem

Research questions and friction points this paper is trying to address.

bimanual dexterous manipulation
data scarcity
embodiment gap
human video imitation
monocular video
Innovation

Methods, ideas, or system contributions that make the work stand out.

bimanual dexterous manipulation
monocular human videos
embodiment gap
trajectory synthesis
zero-shot deployment
🔎 Similar Papers
No similar papers found.
J
Juncheng Mu
Shanghai AI Laboratory
Sizhe Yang
Sizhe Yang
CUHK
Embodied AIRobotics
Y
Yiming Bao
Tsinghua University
H
Hojin Bae
Tsinghua University
Tianming Wei
Tianming Wei
Tsinghua University
Robotics
L
Linning Xu
Shanghai AI Laboratory
Boyi Li
Boyi Li
NVIDIA Research, UC Berkeley
Multimodal LearningRoboticsComputer VisionMachine Learning
Huazhe Xu
Huazhe Xu
Tsinghua University
Embodied AIReinforcement LearningComputer VisionDeep Learning
J
Jiangmiao Pang
Shanghai AI Laboratory