CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning

πŸ“… 2026-01-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that multimodal large language models struggle with causal spatial reasoning in 3D scenes, often generating spatial hallucinations due to overreliance on linguistic priors. To this end, the authors present CausalSpatial, the first systematically defined benchmark for evaluating models’ ability to predict the physical consequences of object interactions, encompassing four task categories: collision, compatibility, occlusion, and trajectory. They propose the COW (Counterfactual Observation with Video) framework, which externalizes causal reasoning by generating hypothetical dynamic videos to guide models toward evidence-based visual reasoning rather than purely textual inference. Experimental results show that while human performance on this benchmark reaches 84% accuracy, GPT-5 achieves only 54%; integrating COW significantly improves model adherence to physical plausibility and mitigates spatial inconsistencies.

Technology Category

Application Category

πŸ“ Abstract
Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer"what-if"questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: https://github.com/CausalSpatial/CausalSpatial
Problem

Research questions and friction points this paper is trying to address.

Causal Spatial Reasoning
Multimodal Large Language Models
Object-Centric Reasoning
Spatial Perception
Counterfactual Prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Spatial Reasoning
Object-Centric Reasoning
Multimodal Large Language Models
Visual Grounding
Causal Simulation
πŸ”Ž Similar Papers
No similar papers found.