🤖 AI Summary
Existing physics benchmarks lack systematic coverage of high-school physics olympiad problems and fail to enable direct, standardized comparison with human performance. To address this, we introduce HiPhO—the first benchmark specifically designed for the International Physics Olympiad (IPhO), covering 13 official contests from 2024–2025, supporting text-and-diagram multimodal inputs, and enabling human-model comparability. Its three key contributions are: (1) systematic curation of recent official competition problems; (2) fine-grained, two-level evaluation—assessing both final answers and solution steps according to official scoring rubrics; and (3) standardized medal assignment (gold/silver/bronze) based on official IPhO cutoffs, enabling the first direct alignment between model and human contestant performance. Evaluation leverages human annotation and multimodal reasoning. Comprehensive testing across 30 mainstream (M)LLMs reveals that open-source multimodal LLMs generally fall below the bronze threshold; some open-source LLMs occasionally achieve gold, while top closed-source reasoning-focused MLLMs attain up to 12 gold medals—yet all remain substantially below perfect human scores, highlighting a significant capability gap between current models and elite high-school physicists.
📝 Abstract
Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with occasional golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight a substantial performance gap between open-source models and top students, the strong physical reasoning capabilities of closed-source reasoning models, and the fact that there is still significant room for improvement. HiPhO, as a rigorous, human-aligned, and Olympiad-focused benchmark for advancing multimodal physical reasoning, is open-source and available at https://github.com/SciYu/HiPhO.