🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from three critical limitations in evaluating scientific reasoning—particularly in physics: coarse-grained subject coverage, neglect of reasoning process assessment, and English-centric benchmarks that fail to disentangle the role of vision. To address these gaps, we introduce Multi-Physics, the first fine-grained, Chinese-language benchmark for multidisciplinary physics reasoning. It comprises 1,412 image-text multiple-choice questions spanning 11 high-school physics subdomains and five difficulty levels. We propose a novel two-dimensional evaluation framework—“subject × difficulty”—and pioneer joint assessment of answer accuracy and chain-of-thought (CoT) completeness. Through input modality ablation studies, we quantitatively measure the contribution of visual information to scientific reasoning. We systematically evaluate 20 state-of-the-art MLLMs on this benchmark and publicly release all data, code, and analysis tools. Multi-Physics establishes a reproducible, attribution-aware evaluation paradigm for Chinese scientific reasoning.
📝 Abstract
While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce extbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https://github.com/luozhongze/Multi-Physics.