Any4D: Open-Prompt 4D Generation from Natural Language and Images

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Embodied world models are hindered by the scarcity, high dimensionality, and prohibitive acquisition cost of embodied interaction data, resulting in coarse-grained language-action alignment and unstable long-horizon video generation—preventing breakthroughs akin to those achieved by large language models like GPT. To address this, we propose Primitive Embodied World Models (PEWM): (1) it constrains video generation to short temporal horizons to enable fine-grained alignment between linguistic concepts and visual representations of robot primitive actions; (2) it introduces a modular vision-language planner coupled with start-goal heatmap guidance to support closed-loop control and compositional generalization over primitive-level policies; and (3) it synergistically integrates spatiotemporal priors from video diffusion models with semantic understanding from vision-language models, yielding an extensible and interpretable embodied intelligence framework. Experiments demonstrate that PEWM substantially reduces data dependency and improves stability in long-horizon generation, establishing a novel paradigm toward the “GPT moment” for embodied AI.

Technology Category

Application Category

📝 Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a extit{"GPT moment"} in the embodied domain. There is a naive observation: extit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose extbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach extit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, extit{2) reduces} learning complexity, extit{3) improves} data efficiency in embodied data collection, and extit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Problem

Research questions and friction points this paper is trying to address.

Addresses reliance on scarce embodied interaction data for world models

Overcomes challenges in long-horizon video generation alignment

Bridges gap between physical interaction and high-level reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Primitive Embodied World Models restrict video generation horizons

Modular Vision-Language Model planner enables flexible closed-loop control

Start-Goal heatmap Guidance mechanism supports compositional generalization

🔎 Similar Papers

No similar papers found.