🤖 AI Summary
This work addresses the challenge of insufficient robustness in motion planning and manipulation skills for humanoid robots performing long-horizon, highly interactive tasks. The authors propose an end-to-end trainable guided diffusion prior that integrates multimodal human and robot motion data within a unified embodied representation space to directly generate high-quality reference trajectories. This approach enables automated synthesis of diverse motor skills without manual intervention and effectively supports downstream reinforcement learning policies. Experimental results on both simulation and the real-world Unitree G1 humanoid robot demonstrate that the proposed method significantly enhances policy robustness and broadens the repertoire of executable skills.
📝 Abstract
Developing robust autonomous loco-manipulation skills for humanoids remains an open problem in robotics. While RL has been applied successfully to legged locomotion, applying it to complex, interaction-rich manipulation tasks is harder given long-horizon planning challenges for manipulation. A recent approach along these lines is DreamControl, which addresses these issues by leveraging off-the-shelf human motion diffusion models as a generative prior to guide RL policies during training. In this paper, we investigate the impact of DreamControl's motion prior and propose an improved framework that trains a guided diffusion model directly in the humanoid robot's motion space, aggregating diverse human and robot datasets into a unified embodiment space. We demonstrate that our approach captures a wider range of skills due to the larger training data mixture and establishes a more automated pipeline by removing the need for manual filtering interventions. Furthermore, we show that scaling the generation of reference trajectories is important for achieving robust downstream RL policies. We validate our approach through extensive experiments in simulation and on a real Unitree-G1.