VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Generative video models suffer from fundamental limitations as world models—including physical inconsistency, lack of interpretability, and absence of interactivity. To address these, we propose a queryable and interactive structured world model centered on a vision-language model (VLM) acting as the core intelligent agent. This agent orchestrates multimodal visual tools and multi-physics simulators (e.g., rigid-body and fluid engines) to automatically infer implicit dynamics from static scenes and perform adaptive simulation. We introduce a novel “VLM–Tool–Simulator” collaborative paradigm that jointly integrates scene graph representation with implicit dynamic modeling, enabling structured scene understanding and semantic-level querying. Evaluated across diverse dynamic scenarios, our approach generates high-fidelity, physically plausible, and interpretable future state predictions. It significantly improves controllability, cross-scene generalization, and causal traceability of world models.

Technology Category

Application Category

📝 Abstract

Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

Problem

Research questions and friction points this paper is trying to address.

Overcoming generative video models' physical/logical violations and opacity

Creating tractable abstract representations from visual data for simulation

Enabling interactive queryable world models through intelligent abstraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-directed abstraction for tractable scene representation

Adaptive physics simulator selection based on scene content

Latent dynamics inference from static scenes for future prediction

🔎 Similar Papers

No similar papers found.