HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos

πŸ“… 2025-09-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of learning full-body object interaction control for humanoid robots from unconstrained monocular RGB videosβ€”a problem hindered by scarce motion data and difficulty in modeling dense physical contacts. We propose an end-to-end learning framework that unifies object representation and introduces a residual action space, coupled with a generalizable interaction reward function. Our method integrates motion trajectory extraction, motion retargeting, and sim-to-real transfer to jointly model robot and object states. To our knowledge, this is the first approach enabling zero-shot deployment of full-body interaction skills directly from unstructured video, with cross-scenario generalization. On the Unitree G1 robot, it successfully executes 67 consecutive door-opening-and-passing sequences and six mobile manipulation tasks in real environments; in simulation, it generalizes across 14 diverse tasks. Results demonstrate strong robustness and generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Enabling robust whole-body humanoid-object interaction (HOI) remains challenging due to motion data scarcity and the contact-rich nature. We present HDMI (HumanoiD iMitation for Interaction), a simple and general framework that learns whole-body humanoid-object interaction skills directly from monocular RGB videos. Our pipeline (i) extracts and retargets human and object trajectories from unconstrained videos to build structured motion datasets, (ii) trains a reinforcement learning (RL) policy to co-track robot and object states with three key designs: a unified object representation, a residual action space, and a general interaction reward, and (iii) zero-shot deploys the RL policies on real humanoid robots. Extensive sim-to-real experiments on a Unitree G1 humanoid demonstrate the robustness and generality of our approach: HDMI achieves 67 consecutive door traversals and successfully performs 6 distinct loco-manipulation tasks in the real world and 14 tasks in simulation. Our results establish HDMI as a simple and general framework for acquiring interactive humanoid skills from human videos.
Problem

Research questions and friction points this paper is trying to address.

Learning whole-body humanoid-object interaction skills from monocular RGB videos
Addressing motion data scarcity and contact-rich nature of humanoid interactions
Developing a framework for zero-shot deployment of policies on real robots
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns whole-body interaction from RGB videos
Uses reinforcement learning with unified object representation
Zero-shot deployment on real humanoid robots
πŸ”Ž Similar Papers
No similar papers found.