MetaWorld: Skill Transfer and Composition in a Hierarchical World Model for Grounding High-Level Instructions

📅 2026-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of bridging the gap between semantic understanding and physical execution in humanoid robot manipulation, where existing approaches suffer from limited sample efficiency, poor generalization, and insufficient physical consistency. The authors propose a hierarchical world model comprising a high-level visual-language model (VLM) that interprets abstract instructions and guides semantic decision-making, and a low-level dynamics model operating in a compact latent state space that leverages a pretrained library of expert policies for efficient skill invocation. A novel dynamic expert selection mechanism, integrated with motion priors, enables an end-to-end mapping from semantics to actions, effectively circumventing the symbol grounding problem. Experiments on Humanoid-Bench demonstrate that the proposed method significantly outperforms current world-model-based reinforcement learning approaches in both task success rate and motion coherence.

Technology Category

Application Category

📝 Abstract
Humanoid robot loco-manipulation remains constrained by the semantic-physical gap. Current methods face three limitations: Low sample efficiency in reinforcement learning, poor generalization in imitation learning, and physical inconsistency in VLMs. We propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. The framework decouples tasks into a VLM-driven semantic layer and a latent dynamics model operating in a compact state space. Our dynamic expert selection and motion prior fusion mechanism leverages a pre-trained multi-expert policy library as transferable knowledge, enabling efficient online adaptation via a two-stage framework. VLMs serve as semantic interfaces, mapping instructions to executable skills and bypassing symbol grounding. Experiments on Humanoid-Bench show MetaWorld outperforms world model-based RL in task completion and motion coherence. Our code will be found at https://anonymous.4open.science/r/metaworld-2BF4/
Problem

Research questions and friction points this paper is trying to address.

semantic-physical gap
humanoid loco-manipulation
sample efficiency
generalization
physical inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical world model
expert policy transfer
vision-language models
skill composition
latent dynamics
🔎 Similar Papers
No similar papers found.
Y
Yutong Shen
Beijing University of Technology
H
Hangxu Liu
Fudan University
K
Kailin Pei
Beijing University of Technology
R
Ruizhe Xia
Beijing University of Technology
Tongtong Feng
Tongtong Feng
Tsinghua University
Environment LearningAutonomous Embodied AIMultimedia Intelligence