Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

232K/year
🤖 AI Summary
This work addresses the challenges of spatial understanding and limited action generalization faced by humanoid robots in complex 3D environments during whole-body manipulation. To overcome these issues, the authors propose a generalizable motion-manipulation framework that integrates the spatial perception and action generation capabilities of multi-agent large models for the first time. The framework features an “active spatial brain” for scene understanding and task planning, coupled with a “universal motor cerebellum” that generates executable actions without requiring task-specific real-world data. Experimental results demonstrate that the proposed approach significantly enhances spatial reasoning across diverse manipulation tasks and achieves efficient, generalizable whole-body manipulation performance on physical humanoid robots.
📝 Abstract
In this paper, we explore spatial-aware humanoid whole-body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real-robot data restricts data-driven models generalization. To address these challenges, we propose a generalizable humanoid loco-manipulation framework that leverages the spatial perception and action generation capabilities of multi-agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision-making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task-specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real-robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.
Problem

Research questions and friction points this paper is trying to address.

humanoid whole-body manipulation
spatial understanding
action generalization
3D environments
real-robot data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

humanoid whole-body manipulation
spatial perception
generalizable action generation
multi-agent large models
loco-manipulation
Z
Zhizhao Liang
School of Computer Science and Engineering, Sun Yat-sen University
Yi-Lin Wei
Yi-Lin Wei
Sun Yat-sen University
Xuhang Chen
Xuhang Chen
Huizhou University
computational imaginglow-level visioncomputational photography
M
Mu Lin
School of Computer Science and Engineering, Sun Yat-sen University
Y
Yi-Xiang He
School of Computer Science and Engineering, Sun Yat-sen University
Z
Zhexi Luo
School of Computer Science and Engineering, Sun Yat-sen University
J
Jun-Hui Liu
School of Computer Science and Engineering, Sun Yat-sen University
Kun-Yu Lin
Kun-Yu Lin
The University of Hong Kong
Computer VisionMachine Learning
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning