🤖 AI Summary
This work addresses the limitations of single-policy approaches in humanoid robot loco-manipulation tasks, which often result in unnatural motions, poor stability, and weak compositional generalization. To overcome these challenges, the authors propose a hierarchical world model framework that trains multiple expert policies—each infused with human motion priors—via imitation-constrained reinforcement learning. A vision-language model (VLM)-driven routing mechanism enables semantic-guided dynamic composition of these experts. This approach achieves, for the first time, semantic-aware adaptive policy scheduling, significantly enhancing motion naturalness, stability, and cross-task compositional generalization while effectively mitigating gradient interference and motion-mode conflicts among diverse skills.
📝 Abstract
Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.