π€ AI Summary
To address the challenge of reliably executing complex, multi-step manipulation tasks by humanoid robots in industrial and domestic settings, this paper proposes a vision-language-driven hierarchical planning and control framework. The framework integrates pretrained vision-language models for high-level skill planning and semantic state perception, combines imitation learning-based mid-level skill policies with reinforcement learning-enabled low-level controllers, and incorporates a real-time state monitoring mechanism to support dynamic task rescheduling and fault-tolerant control. Its key innovation lies in the first deep integration of multimodal semantic understanding into the hierarchical control architecture of humanoid robots, significantly enhancing task robustness and generalization. Evaluated on the Unitree G1 platform through 40 real-world trials, the method achieves a 72.5% end-to-end task success rate, demonstrating effectiveness and practicality in challenging non-grasping manipulation tasks such as contact-based object transport.
π Abstract
Enabling humanoid robots to reliably execute complex multi-step manipulation tasks is crucial for their effective deployment in industrial and household environments. This paper presents a hierarchical planning and control framework designed to achieve reliable multi-step humanoid manipulation. The proposed system comprises three layers: (1) a low-level RL-based controller responsible for tracking whole-body motion targets; (2) a mid-level set of skill policies trained via imitation learning that produce motion targets for different steps of a task; and (3) a high-level vision-language planning module that determines which skills should be executed and also monitors their completion in real-time using pretrained vision-language models (VLMs). Experimental validation is performed on a Unitree G1 humanoid robot executing a non-prehensile pick-and-place task. Over 40 real-world trials, the hierarchical system achieved a 72.5% success rate in completing the full manipulation sequence. These experiments confirm the feasibility of the proposed hierarchical system, highlighting the benefits of VLM-based skill planning and monitoring for multi-step manipulation scenarios. See https://vlp-humanoid.github.io/ for video demonstrations of the policy rollout.