PhyX: Does Your Model Have the"Wits"for Physical Reasoning?

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks overlook physical reasoning—the integration of domain knowledge, symbolic reasoning, and real-world physical constraints. Method: We introduce PhyX, the first large-scale visual benchmark for physical reasoning, covering six core physics domains, 25 subfields, and over 3,000 multimodal questions. It formally defines and quantifies physical reasoning capability and establishes a fine-grained, multi-paradigm evaluation framework. Our methodology integrates cross-domain physical knowledge modeling, case-driven attribution analysis, and a VLMEvalKit-compatible evaluation protocol. Contribution/Results: PhyX reveals fundamental limitations in state-of-the-art multimodal LLMs (e.g., GPT-4o), including rote memorization, formula dependency, and superficial visual matching—yielding accuracies of only 32.5%–45.8%, over 29 percentage points below human experts. The benchmark and an open-source, one-click evaluation toolkit are publicly released to advance standardized research in physics-aware AI.

Technology Category

Application Category

📝 Abstract
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.
Problem

Research questions and friction points this paper is trying to address.

Assessing models' capacity for physics-grounded reasoning in visual scenarios
Identifying limitations in current models' physical understanding and reasoning
Providing a comprehensive benchmark for evaluating physical reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal benchmark for physical reasoning
Comprehensive evaluation across six physics domains
Compatible one-click evaluation protocol for reproducibility
🔎 Similar Papers
No similar papers found.
H
Hui Shen
The University of Hong Kong
Taiqiang Wu
Taiqiang Wu
University of Hong Kong | Tsinghua University
Model CompressionEfficient Methods
Q
Qi Han
Independent
Y
Yunta Hsieh
University of Michigan
Jizhou Wang
Jizhou Wang
University of Toronto; Illinois Institute of Technology
Large Language ModelAgentTime Series
Y
Yuyue Zhang
Independent
Y
Yuxin Cheng
The University of Hong Kong
Z
Zijian Hao
Independent
Yuansheng Ni
Yuansheng Ni
University of Waterloo
Artificial IntelligenceNatural Language ProcessingLarge Language Models
X
Xin Wang
The Ohio State University
Zhongwei Wan
Zhongwei Wan
The Ohio State University, PhD student
LLMMultimodalNLP
K
Kai Zhang
The Ohio State University
W
Wendong Xu
The University of Hong Kong
J
Jing Xiong
The University of Hong Kong
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning
C
Chaofan Tao
The University of Hong Kong
Z
Zhuoqing Mao
University of Michigan
N
Ngai Wong
The University of Hong Kong