Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Embodied world models face critical bottlenecks—including scarcity of embodied interaction data, difficulty in aligning high-dimensional multimodal representations, and challenges in long-horizon video generation—hindering their evolution toward general embodied intelligence. To address these, we propose Primitive Embodied World Models (PEWM): (1) it restricts video generation to short horizons to enable fine-grained alignment between linguistic concepts and visual action representations; (2) it integrates a modular vision-language planner with a start-goal heatmap guidance (SGG) mechanism to jointly model physical interaction and high-level semantic reasoning; and (3) it incorporates spatiotemporal visual priors to enhance generalization. PEWM substantially reduces data requirements and inference latency while achieving high-precision long-horizon planning, compositional generalization over multi-step policies, and robust closed-loop control in complex tasks. This work establishes a scalable, interpretable paradigm for general embodied intelligence.

Technology Category

Application Category

📝 Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Problem

Research questions and friction points this paper is trying to address.

Addressing reliance on large-scale embodied interaction data for world models

Overcoming scarcity and high dimensionality of embodied data collection

Improving alignment between language concepts and robotic actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Primitive Embodied World Models paradigm

Short-horizon video generation alignment

Modular VLM planner with heatmap guidance

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey