🤖 AI Summary
Despite strong perceptual capabilities, state-of-the-art vision-language models (VLMs) exhibit limited competence in physical dynamics understanding and causal reasoning—particularly in counterfactual prediction and physics-based inference. Method: We systematically evaluate six SOTA VLMs across three physics simulation benchmarks—CLEVRER, Physion, and our newly introduced Physion++—using diagnostic subtests that decouple perceptual processing from physical reasoning via outcome prediction and counterfactual reasoning tasks. Contribution/Results: Empirical results reveal substantial performance variance across models; neither strong perception nor symbolic reasoning correlates with improved physical prediction accuracy. Crucially, perceptual and physical reasoning capabilities show only weak correlation, exposing a fundamental “perception–causality” dissociation in current VLMs. This work introduces the first causal understanding diagnostic framework grounded in disentangled evaluation, providing both empirical evidence and methodological foundations for designing next-generation architectures that deeply integrate perception and causal reasoning.
📝 Abstract
Leading Vision-Language Models (VLMs) show strong results in visual perception and general reasoning, but their ability to understand and predict physical dynamics remains unclear. We benchmark six frontier VLMs on three physical simulation datasets - CLEVRER, Physion, and Physion++ - where the evaluation tasks test whether a model can predict outcomes or hypothesize about alternative situations. To probe deeper, we design diagnostic subtests that isolate perception (objects, colors, occluders) from physics reasoning (motion prediction, spatial relations). Intuitively, stronger diagnostic performance should support higher evaluation accuracy. Yet our analysis reveals weak correlations: models that excel at perception or physics reasoning do not consistently perform better on predictive or counterfactual evaluation. This counterintuitive gap exposes a central limitation of current VLMs: perceptual and physics skills remain fragmented and fail to combine into causal understanding, underscoring the need for architectures that bind perception and reasoning more tightly.