๐ค AI Summary
Existing world models rely solely on observational data for prediction, rendering them incapable of answering counterfactual queriesโe.g., โWhat would happen if object X were removed?โโthus limiting their applicability in causal reasoning tasks such as physical AI behavior evaluation. To address this, we propose CWMDT, the first framework integrating digital twins, large language models (LLMs), and video diffusion models. CWMDT employs structured textual representations to disentangle scene elements, enabling explicit interventions on specific object attributes and facilitating controllable, temporally coherent counterfactual video generation outside pixel space. This paradigm shift transcends traditional observation-driven modeling, advancing from passive prediction to active causal simulation. Evaluated on two counterfactual video benchmarks, CWMDT achieves state-of-the-art performance, demonstrating both the effectiveness and scalability of digital twin representations for modeling complex causal interventions.
๐ Abstract
World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.