🤖 AI Summary
This work addresses a fundamental limitation of current vision-language models (VLMs) in visual-physical joint reasoning: insufficient coupling between diagram interpretation and physical reasoning, coupled with excessive reliance on textual cues. To tackle this, we introduce SeePhys—the first large-scale multimodal benchmark explicitly designed for physics reasoning—spanning seven physics domains and 21 heterogeneous diagram types, with 75% of items rigorously classified as “vision-essential.” Our methodology features novel physics knowledge graph alignment, fine-grained diagram annotation, and cross-difficulty item synthesis, enabling a K–12 to PhD-level hierarchical evaluation framework. Extensive experiments reveal that state-of-the-art VLMs—including Gemini-2.5-Pro and o4-mini—achieve less than 60% accuracy, demonstrating a critical bottleneck in rigorous visual-physical co-reasoning. SeePhys thus establishes a foundational benchmark and diagnostic tool for advancing physically grounded multimodal intelligence.
📝 Abstract
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.