🤖 AI Summary
Generative video models suffer from fundamental limitations as world models—including physical inconsistency, lack of interpretability, and absence of interactivity. To address these, we propose a queryable and interactive structured world model centered on a vision-language model (VLM) acting as the core intelligent agent. This agent orchestrates multimodal visual tools and multi-physics simulators (e.g., rigid-body and fluid engines) to automatically infer implicit dynamics from static scenes and perform adaptive simulation. We introduce a novel “VLM–Tool–Simulator” collaborative paradigm that jointly integrates scene graph representation with implicit dynamic modeling, enabling structured scene understanding and semantic-level querying. Evaluated across diverse dynamic scenarios, our approach generates high-fidelity, physically plausible, and interpretable future state predictions. It significantly improves controllability, cross-scene generalization, and causal traceability of world models.
📝 Abstract
Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.