FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal benchmarks lack the capacity to evaluate global structural logic and physical constraints inherent in scientific diagrams. This work proposes the first multimodal large model evaluation benchmark specifically designed for Feynman diagram tasks, leveraging an automated pipeline to generate over 2,000 diverse problems encompassing the three fundamental interactions of the Standard Model. The benchmark requires models to perform multi-step reasoning—identifying diagram topology, satisfying conservation laws and symmetry constraints, executing diagrammatic algebraic transformations, and constructing scattering amplitudes. By systematically integrating Feynman diagrams from theoretical physics into AI evaluation, this benchmark emphasizes consistency with both global topological structure and physical principles. Experimental results reveal systematic failures of mainstream models in enforcing physical constraints and comprehending global diagrammatic structure, underscoring the necessity of physics-driven visual reasoning benchmarks.
📝 Abstract
Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.
Problem

Research questions and friction points this paper is trying to address.

diagrammatic reasoning
multimodal LLMs
Feynman diagrams
physics reasoning
scientific notation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feynman diagrams
multimodal LLMs
diagrammatic reasoning
physics-grounded benchmark
scattering amplitudes
🔎 Similar Papers
No similar papers found.
Z
Zeyu Wang
Alibaba Group
X
Xiaogang Li
Alibaba Group
Peiyao Xiao
Peiyao Xiao
Ph.D. candidate at University at Buffalo
Multi-objective optimizationFederated learningBilevel optimization
Q
Qinhao Kong
Skylenage
Ben Wang
Ben Wang
University of Oklahoma
C
Chengliang Xu
Alibaba Group
Z
Zichao Chen
Alibaba Group
Bing Zhao
Bing Zhao
SRI International
Natural Language ProcessingMachine LearningOptimizations
H
Hu Wei
Alibaba Group