Vision Language Models Cannot Reason About Physical Transformation

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

It remains unclear whether current vision-language models (VLMs) genuinely understand conservation principles underlying physical transformations. To address this, this work introduces ConservationBench, a benchmark comprising 23,040 paired questions across four physical attributes, designed to systematically evaluate VLMs’ capacity for reasoning about physical conservation. Through controlled experiments examining the effects of prompting strategies, temporal resolution, and sampling methods, the study reveals that 112 mainstream VLMs perform near-randomly on conservation tasks, with visual input often degrading performance—indicating an overreliance on textual priors and a lack of robust visual dynamic representations of physical invariance. This work establishes a new benchmark and offers critical insights into the limitations of VLMs in physical commonsense reasoning.

Technology Category

Application Category

📝 Abstract

Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with visual content. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.

Problem

Research questions and friction points this paper is trying to address.

Vision Language Models

physical transformation

conservation

invariance

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Models

physical reasoning

conservation