🤖 AI Summary
Existing world models struggle to capture the causal dynamics of interactions among objects, limiting their capacity for complex reasoning and control. This work proposes Causal-JEPA, which introduces an object-level masking mechanism into the joint-embedding predictive architecture for the first time. By predicting the state of a masked object from the states of other objects, the model induces latent interventions and incorporates causal inductive biases, thereby learning object interaction representations endowed with counterfactual reasoning capabilities. Experimental results demonstrate that Causal-JEPA improves counterfactual reasoning accuracy by approximately 20% on visual question answering tasks and achieves performance comparable to patch-based models in control tasks using only 1% of the latent variable features.
📝 Abstract
World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.