InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Current large multimodal models exhibit significant deficiencies in inductive physical reasoning, struggling to generalize to unseen physical environments—thereby limiting their deployment in safety-critical applications. To address this, we introduce InPhyRe, the first benchmark explicitly designed to evaluate vision-driven inductive physical reasoning. It employs algorithmically generated synthetic collision videos paired with visual question-answering tasks to systematically assess model generalization when physical laws are violated. InPhyRe formally defines and quantifies this capability for the first time, uncovering pervasive issues across mainstream models: linguistic prior bias, insufficient utilization of visual inputs, and conflation of parametric knowledge with genuine reasoning ability. Experiments spanning 13 state-of-the-art models demonstrate substantial performance degradation under non-standard physical conditions, highlighting critical reliability risks in real-world deployment.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) encode universal physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning when the inference scenario violates these physical laws. In contrast, humans possess the skill to adapt their physical reasoning to unseen physical environments from a few visual examples. This ability, which we refer to as inductive physical reasoning, is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks evaluate only the parametric knowledge in LMMs, and not inductive physical reasoning. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs on their ability to predict the outcome of collision events in algorithmically generated synthetic collision videos. By inspecting 13 LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when demonstration samples violate universal physical laws, and (3) inductive physical reasoning in LMMs suffers from language bias and largely ignores the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' inductive physical reasoning in unseen scenarios

Assessing LMMs' ability to adapt to violated physical laws

Measuring LMMs' visual dependency and language bias limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

InPhyRe benchmark tests inductive physical reasoning

Uses synthetic collision videos for evaluation

Measures LMMs' ability with unseen physical laws

🔎 Similar Papers

No similar papers found.