Does Physics Knowledge Emerge in Frontier Models?

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Despite strong perceptual capabilities, state-of-the-art vision-language models (VLMs) exhibit limited competence in physical dynamics understanding and causal reasoning—particularly in counterfactual prediction and physics-based inference. Method: We systematically evaluate six SOTA VLMs across three physics simulation benchmarks—CLEVRER, Physion, and our newly introduced Physion++—using diagnostic subtests that decouple perceptual processing from physical reasoning via outcome prediction and counterfactual reasoning tasks. Contribution/Results: Empirical results reveal substantial performance variance across models; neither strong perception nor symbolic reasoning correlates with improved physical prediction accuracy. Crucially, perceptual and physical reasoning capabilities show only weak correlation, exposing a fundamental “perception–causality” dissociation in current VLMs. This work introduces the first causal understanding diagnostic framework grounded in disentangled evaluation, providing both empirical evidence and methodological foundations for designing next-generation architectures that deeply integrate perception and causal reasoning.

Technology Category

Application Category

📝 Abstract

Leading Vision-Language Models (VLMs) show strong results in visual perception and general reasoning, but their ability to understand and predict physical dynamics remains unclear. We benchmark six frontier VLMs on three physical simulation datasets - CLEVRER, Physion, and Physion++ - where the evaluation tasks test whether a model can predict outcomes or hypothesize about alternative situations. To probe deeper, we design diagnostic subtests that isolate perception (objects, colors, occluders) from physics reasoning (motion prediction, spatial relations). Intuitively, stronger diagnostic performance should support higher evaluation accuracy. Yet our analysis reveals weak correlations: models that excel at perception or physics reasoning do not consistently perform better on predictive or counterfactual evaluation. This counterintuitive gap exposes a central limitation of current VLMs: perceptual and physics skills remain fragmented and fail to combine into causal understanding, underscoring the need for architectures that bind perception and reasoning more tightly.

Problem

Research questions and friction points this paper is trying to address.

Evaluating physical dynamics understanding in vision-language models

Assessing correlation between perception skills and physics reasoning

Revealing fragmented integration of perception and causal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking VLMs on physical simulation datasets

Designing diagnostic subtests isolating perception from reasoning

Revealing weak correlation between perception and physics skills

🔎 Similar Papers

Emergence of a High-Dimensional Abstraction Phase in Language Transformers