SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses a fundamental limitation of current vision-language models (VLMs) in visual-physical joint reasoning: insufficient coupling between diagram interpretation and physical reasoning, coupled with excessive reliance on textual cues. To tackle this, we introduce SeePhys—the first large-scale multimodal benchmark explicitly designed for physics reasoning—spanning seven physics domains and 21 heterogeneous diagram types, with 75% of items rigorously classified as “vision-essential.” Our methodology features novel physics knowledge graph alignment, fine-grained diagram annotation, and cross-difficulty item synthesis, enabling a K–12 to PhD-level hierarchical evaluation framework. Extensive experiments reveal that state-of-the-art VLMs—including Gemini-2.5-Pro and o4-mini—achieve less than 60% accuracy, demonstrating a critical bottleneck in rigorous visual-physical co-reasoning. SeePhys thus establishes a foundational benchmark and diagnostic tool for advancing physically grounded multimodal intelligence.

Technology Category

Application Category

📝 Abstract

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking vision-based physics reasoning across education levels

Assessing visual understanding in physics problem-solving

Evaluating LLMs' diagram interpretation and reasoning coupling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal benchmark for physics reasoning

Vision-essential problems requiring visual information extraction

Evaluates advanced models' diagram interpretation and reasoning

🔎 Similar Papers

Compositional Physical Reasoning of Objects and Events from Videos