Lifting Embodied World Models for Planning and Control

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
High-dimensional action spaces pose significant challenges for efficient planning and control in world models of embodied agents. This work proposes the "Elevated World Model" framework, which integrates a frozen visual world model with a lightweight policy network that maps interpretable, low-dimensional high-level actions—such as 2D waypoints—into sequences of low-level joint actions. By performing planning in this reduced action space using algorithms like Cross-Entropy Method (CEM), the approach substantially reduces computational complexity. Experimental results demonstrate that, compared to direct planning in joint space, the proposed method achieves a 3.8× reduction in average joint error for target poses while maintaining strong generalization capabilities in unseen environments.
📝 Abstract
World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space ($3.8\times$ lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.
Problem

Research questions and friction points this paper is trying to address.

world models
embodied agents
high-dimensional action spaces
planning
control
Innovation

Methods, ideas, or system contributions that make the work stand out.

lifted world models
embodied agents
hierarchical action spaces
visual waypoints
planning efficiency
🔎 Similar Papers
2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94