🤖 AI Summary
Existing benchmarks overlook physical reasoning—the integration of domain knowledge, symbolic reasoning, and real-world physical constraints. Method: We introduce PhyX, the first large-scale visual benchmark for physical reasoning, covering six core physics domains, 25 subfields, and over 3,000 multimodal questions. It formally defines and quantifies physical reasoning capability and establishes a fine-grained, multi-paradigm evaluation framework. Our methodology integrates cross-domain physical knowledge modeling, case-driven attribution analysis, and a VLMEvalKit-compatible evaluation protocol. Contribution/Results: PhyX reveals fundamental limitations in state-of-the-art multimodal LLMs (e.g., GPT-4o), including rote memorization, formula dependency, and superficial visual matching—yielding accuracies of only 32.5%–45.8%, over 29 percentage points below human experts. The benchmark and an open-source, one-click evaluation toolkit are publicly released to advance standardized research in physics-aware AI.
📝 Abstract
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.