MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of single-policy approaches in humanoid robot loco-manipulation tasks, which often result in unnatural motions, poor stability, and weak compositional generalization. To overcome these challenges, the authors propose a hierarchical world model framework that trains multiple expert policies—each infused with human motion priors—via imitation-constrained reinforcement learning. A vision-language model (VLM)-driven routing mechanism enables semantic-guided dynamic composition of these experts. This approach achieves, for the first time, semantic-aware adaptive policy scheduling, significantly enhancing motion naturalness, stability, and cross-task compositional generalization while effectively mitigating gradient interference and motion-mode conflicts among diverse skills.

Technology Category

Application Category

📝 Abstract
Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

humanoid locomotion
manipulation
compositional generalization
whole-body control
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical World Modeling
Specialized Expert Policies
Vision-Language Model
Imitation-Constrained Reinforcement Learning
Compositional Generalization
🔎 Similar Papers
No similar papers found.
Y
Yutong Shen
School of Information Science and Technology, Beijing University of Technology, Beijing, China
H
Hangxu Liu
School of Information Science and Engineering, Fudan University, Shanghai, China
P
Penghui Liu
School of Information Science and Technology, Beijing University of Technology, Beijing, China
J
Jiashuo Luo
School of Information Science and Technology, Beijing University of Technology, Beijing, China
Yongkang Zhang
Yongkang Zhang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
Cloud ComputingGPU CompilerGPU Virtualization
R
Rex Morvley
School of Information Science and Technology, Beijing University of Technology, Beijing, China
Chen Jiang
Chen Jiang
University of Alberta
Computer VisionDeep LearningRobotics
J
Jianwei Zhang
University of Hamburg, Hamburg, Germany
Lei Zhang
Lei Zhang
University of Hamburg, Agile Robots SE
Dexterous ManipulationMulti-modal AIEmbodied AI