Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from three critical limitations in evaluating scientific reasoning—particularly in physics: coarse-grained subject coverage, neglect of reasoning process assessment, and English-centric benchmarks that fail to disentangle the role of vision. To address these gaps, we introduce Multi-Physics, the first fine-grained, Chinese-language benchmark for multidisciplinary physics reasoning. It comprises 1,412 image-text multiple-choice questions spanning 11 high-school physics subdomains and five difficulty levels. We propose a novel two-dimensional evaluation framework—“subject × difficulty”—and pioneer joint assessment of answer accuracy and chain-of-thought (CoT) completeness. Through input modality ablation studies, we quantitatively measure the contribution of visual information to scientific reasoning. We systematically evaluate 20 state-of-the-art MLLMs on this benchmark and publicly release all data, code, and analysis tools. Multi-Physics establishes a reproducible, attribution-aware evaluation paradigm for Chinese scientific reasoning.

Technology Category

Application Category

📝 Abstract
While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce extbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https://github.com/luozhongze/Multi-Physics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs on Chinese physics reasoning
Addressing gaps in subject coverage and visual information
Assessing step-by-step reasoning and difficulty levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese physics benchmark with multimodal evaluation
Dual framework assessing accuracy and reasoning steps
Systematic visual impact analysis via input mode variation
🔎 Similar Papers
No similar papers found.
Zhongze Luo
Zhongze Luo
The Chinese University of Hong Kong, Shenzhen
LLMKGRAG
Z
Zhenshuai Yin
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China
Y
Yongxin Guo
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China
Z
Zhichao Wang
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China
J
Jionghao Zhu
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China
X
Xiaoying Tang
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China