🤖 AI Summary
To address the challenge of generalizing interactive behaviors—such as object manipulation and sit-to-stand transitions—for humanoid robots in real-world environments, this paper proposes a unified simulation-to-reality (Sim2Real) interactive architecture. Methodologically, it innovatively integrates adversarial motion-prior policy learning with LiDAR-camera coarse-to-fine multimodal localization, enabling natural motion generation and robust scene perception; action optimization is further enhanced via reinforcement learning and Sim2Real policy transfer to improve cross-scenario generalization. Evaluated on four interactive tasks, the approach achieves high success rates in both simulation and real-robot deployment, significantly outperforming baselines: motions are more human-like, localization error is reduced by 32%, and task generalization improves by 41%. This work constitutes the first framework to holistically integrate motion-prior modeling, continuous multimodal perception, and Sim2Real policy transfer, providing a scalable foundation for natural and robust embodied interaction.
📝 Abstract
Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.