Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) suffer from three critical limitations in evaluating scientific reasoning—particularly in physics: coarse-grained subject coverage, neglect of reasoning process assessment, and English-centric benchmarks that fail to disentangle the role of vision. To address these gaps, we introduce Multi-Physics, the first fine-grained, Chinese-language benchmark for multidisciplinary physics reasoning. It comprises 1,412 image-text multiple-choice questions spanning 11 high-school physics subdomains and five difficulty levels. We propose a novel two-dimensional evaluation framework—“subject × difficulty”—and pioneer joint assessment of answer accuracy and chain-of-thought (CoT) completeness. Through input modality ablation studies, we quantitatively measure the contribution of visual information to scientific reasoning. We systematically evaluate 20 state-of-the-art MLLMs on this benchmark and publicly release all data, code, and analysis tools. Multi-Physics establishes a reproducible, attribution-aware evaluation paradigm for Chinese scientific reasoning.

Technology Category

Application Category

📝 Abstract

While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce extbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https://github.com/luozhongze/Multi-Physics.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs on Chinese physics reasoning

Addressing gaps in subject coverage and visual information

Assessing step-by-step reasoning and difficulty levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese physics benchmark with multimodal evaluation

Dual framework assessing accuracy and reasoning steps

Systematic visual impact analysis via input mode variation

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models