🤖 AI Summary
This work addresses the challenges of spatial understanding and limited action generalization faced by humanoid robots in complex 3D environments during whole-body manipulation. To overcome these issues, the authors propose a generalizable motion-manipulation framework that integrates the spatial perception and action generation capabilities of multi-agent large models for the first time. The framework features an “active spatial brain” for scene understanding and task planning, coupled with a “universal motor cerebellum” that generates executable actions without requiring task-specific real-world data. Experimental results demonstrate that the proposed approach significantly enhances spatial reasoning across diverse manipulation tasks and achieves efficient, generalizable whole-body manipulation performance on physical humanoid robots.
📝 Abstract
In this paper, we explore spatial-aware humanoid whole-body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real-robot data restricts data-driven models generalization. To address these challenges, we propose a generalizable humanoid loco-manipulation framework that leverages the spatial perception and action generation capabilities of multi-agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision-making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task-specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real-robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.