Probing Mechanical Reasoning in Large Vision Language Models

📅 2024-10-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the fundamental limitations of large vision-language models (VLMs) in mechanical reasoning—specifically across gear systems, fluid dynamics, levers, pulleys, and inertial motion—and identifies pathways for improvement. Method: Leveraging 155 cognitive psychology experiments, we systematically evaluate 26 state-of-the-art VLMs and introduce MechBench, the first comprehensive, multi-physics-domain benchmark for mechanical reasoning. Contribution/Results: All models exhibit substantial performance gaps relative to human baselines, especially in gear engagement modeling and dynamic fluid understanding; model size shows no significant correlation with performance, exposing intrinsic limitations of current attention mechanisms in mental simulation–based reasoning. Our work establishes the first controllable, cross-physics-domain evaluation framework; through cross-modal behavioral analysis, it pinpoints systematic reasoning failures; and it provides a novel assessment paradigm and concrete directions for advancing embodied AI and physical commonsense modeling.

Technology Category

Application Category

📝 Abstract
Mechanical reasoning is a hallmark of human intelligence, defined by its ubiquitous yet irreplaceable role in human activities ranging from routine tasks to civil engineering. Embedding machines with mechanical reasoning is therefore an important step towards building human-level artificial intelligence. Here, we leveraged 155 cognitive experiments to test the understanding of system stability, gears and pulley systems, leverage principle, inertia and motion, and fluid mechanics in 26 Vision Language Models (VLMs). Results indicate that VLMs consistently perform worse than humans on all domains, while demonstrate significant difficulty in reasoning about gear systems and fluid mechanics. Notably, their performance on these tasks do not improve as number of parameters increase, suggesting that current attention-based architecture may fail to grasp certain underlying mechanisms required for mechanical reasoning, particularly those pertaining to mental simulations.
Problem

Research questions and friction points this paper is trying to address.

Tests mechanical reasoning in Vision Language Models.
Explores difficulties in gear systems and fluid mechanics.
Assesses impact of model parameters on performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Models
Mechanical reasoning tasks
Attention-based architecture limitations
🔎 Similar Papers
No similar papers found.
H
Haoran Sun
Johns Hopkins University
Q
Qingying Gao
Johns Hopkins University
H
Haiyun Lyu
University of North Carolina at Chapel Hill
Dezhi Luo
Dezhi Luo
University of Michigan
cognitive sciencephilosophyAI
Hokin Deng
Hokin Deng
Johns Hopkins University
cognition
Yijiang Li
Yijiang Li
Argonne National Laboratory