🤖 AI Summary
Existing physical reasoning benchmarks primarily rely on text-only inputs or evaluate only final answers, neglecting critical intermediate steps—such as variable identification and process modeling—thus failing to comprehensively assess the physical reasoning capabilities of multimodal large language models (MLLMs).
Method: We introduce PhysBench, the first multimodal physical reasoning benchmark tailored for MLLMs, featuring a novel “variable–process–solution” tripartite reasoning framework quantified via structured annotations. It integrates heterogeneous modalities (images, mathematical formulas, and text), leverages multimodal prompt engineering and physics-knowledge injection, and covers 12 classical physics scenarios with over 3,000 high-quality, multi-step reasoning samples.
Contribution/Results: PhysBench enables fine-grained, interpretable evaluation of MLLMs’ physical reasoning, significantly improving assessment reliability and model discriminability. It bridges two critical gaps: the lack of multimodal physical reasoning benchmarks and the absence of process-oriented, stepwise evaluation protocols.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation. PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.