π€ AI Summary
This work addresses the spatial misalignment commonly encountered by humanoid robots in interactive tasks, which arises from discrepancies between human pose estimates and the robotβs own morphology, causing conventional retargeting methods to fail due to skeletal scale mismatches. To overcome this, the authors propose Dream2Act, a novel framework that achieves zero-shot, retargeting-free full-body interaction for the first time. By leveraging a robot-centric generative video model, the approach directly synthesizes native-compliant motions from third-person images, integrating high-fidelity pose extraction with a universal whole-body controller to produce physically feasible joint trajectories within the robotβs intrinsic coordinate system. Evaluated on the Unitree G1 platform across four locomotion-interaction tasks, the method attains an overall success rate of 37.5%, substantially outperforming traditional approaches (0%) and enabling reliable physical contact.
π Abstract
Equipping humanoid robots with versatile interaction skills typically requires either extensive policy training or explicit human-to-robot motion retargeting. However, learning-based policies face prohibitive data collection costs. Meanwhile, retargeting relies on human-centric pose estimation (e.g., SMPL), introducing a morphology gap. Skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success. In this work, we propose Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis. Given a third-person image of the robot and target object, our framework leverages video generation models to envision the robot completing the task with morphology-consistent motion. We employ a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from these synthesized dreams, subsequently executed via a general-purpose whole-body controller. Operating strictly within the robot-native coordinate space, Dream2Act avoids retargeting errors and eliminates task-specific policy training. We evaluate Dream2Act on the Unitree G1 across four whole-body mobile interaction tasks: ball kicking, sofa sitting, bag punching, and box hugging. Dream2Act achieves a 37.5% overall success rate, compared to 0% for conventional retargeting. While retargeting fails to establish correct physical contacts due to the morphology gap (with errors compounded during locomotion), Dream2Act maintains robot-consistent spatial alignment, enabling reliable contact formation and substantially higher task completion.