Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the challenge of reliably executing complex, multi-step manipulation tasks by humanoid robots in industrial and domestic settings, this paper proposes a vision-language-driven hierarchical planning and control framework. The framework integrates pretrained vision-language models for high-level skill planning and semantic state perception, combines imitation learning-based mid-level skill policies with reinforcement learning-enabled low-level controllers, and incorporates a real-time state monitoring mechanism to support dynamic task rescheduling and fault-tolerant control. Its key innovation lies in the first deep integration of multimodal semantic understanding into the hierarchical control architecture of humanoid robots, significantly enhancing task robustness and generalization. Evaluated on the Unitree G1 platform through 40 real-world trials, the method achieves a 72.5% end-to-end task success rate, demonstrating effectiveness and practicality in challenging non-grasping manipulation tasks such as contact-based object transport.

Technology Category

Application Category

📝 Abstract

Enabling humanoid robots to reliably execute complex multi-step manipulation tasks is crucial for their effective deployment in industrial and household environments. This paper presents a hierarchical planning and control framework designed to achieve reliable multi-step humanoid manipulation. The proposed system comprises three layers: (1) a low-level RL-based controller responsible for tracking whole-body motion targets; (2) a mid-level set of skill policies trained via imitation learning that produce motion targets for different steps of a task; and (3) a high-level vision-language planning module that determines which skills should be executed and also monitors their completion in real-time using pretrained vision-language models (VLMs). Experimental validation is performed on a Unitree G1 humanoid robot executing a non-prehensile pick-and-place task. Over 40 real-world trials, the hierarchical system achieved a 72.5% success rate in completing the full manipulation sequence. These experiments confirm the feasibility of the proposed hierarchical system, highlighting the benefits of VLM-based skill planning and monitoring for multi-step manipulation scenarios. See https://vlp-humanoid.github.io/ for video demonstrations of the policy rollout.

Problem

Research questions and friction points this paper is trying to address.

Enable humanoid robots to execute complex multi-step tasks reliably

Develop hierarchical planning for vision-language manipulation control

Improve success rate in real-world multi-step manipulation scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical planning with three control layers

RL-based low-level motion tracking controller

VLM-based high-level skill planning and monitoring

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)