MetaWorld: Skill Transfer and Composition in a Hierarchical World Model for Grounding High-Level Instructions

📅 2026-01-24

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of bridging the gap between semantic understanding and physical execution in humanoid robot manipulation, where existing approaches suffer from limited sample efficiency, poor generalization, and insufficient physical consistency. The authors propose a hierarchical world model comprising a high-level visual-language model (VLM) that interprets abstract instructions and guides semantic decision-making, and a low-level dynamics model operating in a compact latent state space that leverages a pretrained library of expert policies for efficient skill invocation. A novel dynamic expert selection mechanism, integrated with motion priors, enables an end-to-end mapping from semantics to actions, effectively circumventing the symbol grounding problem. Experiments on Humanoid-Bench demonstrate that the proposed method significantly outperforms current world-model-based reinforcement learning approaches in both task success rate and motion coherence.

Technology Category

Application Category

📝 Abstract

Humanoid robot loco-manipulation remains constrained by the semantic-physical gap. Current methods face three limitations: Low sample efficiency in reinforcement learning, poor generalization in imitation learning, and physical inconsistency in VLMs. We propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. The framework decouples tasks into a VLM-driven semantic layer and a latent dynamics model operating in a compact state space. Our dynamic expert selection and motion prior fusion mechanism leverages a pre-trained multi-expert policy library as transferable knowledge, enabling efficient online adaptation via a two-stage framework. VLMs serve as semantic interfaces, mapping instructions to executable skills and bypassing symbol grounding. Experiments on Humanoid-Bench show MetaWorld outperforms world model-based RL in task completion and motion coherence. Our code will be found at https://anonymous.4open.science/r/metaworld-2BF4/

Problem

Research questions and friction points this paper is trying to address.

semantic-physical gap

humanoid loco-manipulation

sample efficiency

generalization

physical inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical world model

expert policy transfer

vision-language models

skill composition

latent dynamics

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)