🤖 AI Summary
This work addresses the challenge of bridging the gap between semantic understanding and physical execution in humanoid robot manipulation, where existing approaches suffer from limited sample efficiency, poor generalization, and insufficient physical consistency. The authors propose a hierarchical world model comprising a high-level visual-language model (VLM) that interprets abstract instructions and guides semantic decision-making, and a low-level dynamics model operating in a compact latent state space that leverages a pretrained library of expert policies for efficient skill invocation. A novel dynamic expert selection mechanism, integrated with motion priors, enables an end-to-end mapping from semantics to actions, effectively circumventing the symbol grounding problem. Experiments on Humanoid-Bench demonstrate that the proposed method significantly outperforms current world-model-based reinforcement learning approaches in both task success rate and motion coherence.
📝 Abstract
Humanoid robot loco-manipulation remains constrained by the semantic-physical gap. Current methods face three limitations: Low sample efficiency in reinforcement learning, poor generalization in imitation learning, and physical inconsistency in VLMs. We propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. The framework decouples tasks into a VLM-driven semantic layer and a latent dynamics model operating in a compact state space. Our dynamic expert selection and motion prior fusion mechanism leverages a pre-trained multi-expert policy library as transferable knowledge, enabling efficient online adaptation via a two-stage framework. VLMs serve as semantic interfaces, mapping instructions to executable skills and bypassing symbol grounding. Experiments on Humanoid-Bench show MetaWorld outperforms world model-based RL in task completion and motion coherence. Our code will be found at https://anonymous.4open.science/r/metaworld-2BF4/