From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit significant deficiencies in spatiotemporal physical reasoning, primarily due to human prior biases and a lack of deep causal inference capabilities. This work presents the first fine-grained, systematic diagnostic evaluation of physical reasoning abilities across mainstream VLMs. Building on these insights, we propose a synergistic optimization framework integrating supervised fine-tuning (SFT) with rule-guided reinforcement learning (Rule-based RL), enabling targeted enhancement of physical logical deduction in Qwen2.5-VL-7B. Experiments demonstrate substantial performance gains over multiple state-of-the-art closed-source models across diverse spatiotemporal physical reasoning benchmarks. Moreover, our analysis uncovers critical generalization bottlenecks under unseen physical scenarios, revealing fundamental limitations of current paradigms. Collectively, this study establishes a novel, interpretable, and controllable pathway for assessing and improving the physical cognition capabilities of VLMs.

Technology Category

Application Category

📝 Abstract

Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model's generalization to new physics scenarios remains limited -- underscoring the pressing need for new approaches in spatio-physical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating spatio-physical reasoning in vision language models

Identifying biases and lack of deep reasoning in VLMs

Improving model performance while addressing generalization limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised fine-tuning on Qwen2.5-VL-7B

Rule-based reinforcement learning implementation

Addressing human-like prior biases in reasoning

🔎 Similar Papers

No similar papers found.