Lifting Embodied World Models for Planning and Control

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

High-dimensional action spaces pose significant challenges for efficient planning and control in world models of embodied agents. This work proposes the "Elevated World Model" framework, which integrates a frozen visual world model with a lightweight policy network that maps interpretable, low-dimensional high-level actions—such as 2D waypoints—into sequences of low-level joint actions. By performing planning in this reduced action space using algorithms like Cross-Entropy Method (CEM), the approach substantially reduces computational complexity. Experimental results demonstrate that, compared to direct planning in joint space, the proposed method achieves a 3.8× reduction in average joint error for target poses while maintaining strong generalization capabilities in unseen environments.

📝 Abstract

World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space ($3.8\times$ lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.

Problem

Research questions and friction points this paper is trying to address.

world models

embodied agents

high-dimensional action spaces

planning

control

Innovation

Methods, ideas, or system contributions that make the work stand out.

lifted world models

embodied agents

hierarchical action spaces