Vid2World: Crafting Video Diffusion Models to Interactive World Models

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing world models suffer from low-fidelity predictions, weak action controllability, and strong domain dependence in complex environments. This paper proposes a causal reformulation framework that transforms pretrained video diffusion models into interactive world models. By introducing a causal action-guidance mechanism—redefining the diffusion objective and adapting the network architecture—we enable high-fidelity, autoregressive dynamic modeling conditioned on explicit actions. The method integrates causal modeling, action-conditional generation, and cross-domain simulation environment integration, achieving significant improvements in prediction accuracy and interaction quality on robotic manipulation and game simulation benchmarks. Our core contribution is the first causal adaptation of video diffusion models into a general-purpose, action-controllable, and cross-domain transferable world model. This establishes a novel paradigm for high-fidelity dynamic environment modeling in embodied intelligence.

Technology Category

Application Category

📝 Abstract

World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.

Problem

Research questions and friction points this paper is trying to address.

Transforming video diffusion models into interactive world models

Enhancing action controllability in autoregressive generation

Improving fidelity and scalability in complex environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Casualization of pre-trained video diffusion models

Causal action guidance for controllability

Autoregressive generation for world modeling

🔎 Similar Papers

No similar papers found.