🤖 AI Summary
Formal reasoning in university-level physics remains challenging due to the lack of rigorous, machine-verifiable frameworks. Method: We introduce the first systematic Lean4-based framework for physics formalization, comprising (1) LeanPhysBench—a benchmark of 200 problems spanning mechanics, electromagnetism, and other core domains, drawn from canonical textbooks and competitions; (2) PhysLib—an open-source physics knowledge library supporting unit-system modeling, foundational theorems, and an extensible axiomatization; and (3) a community-driven knowledge curation paradigm, integrated with closed-source large language models for automated theorem proving experiments. Contributions/Results: This work establishes the first university-physics–level Lean4 benchmark; PhysLib improves average proof accuracy of mainstream models by 11.75%; and our analysis identifies critical bottlenecks—including symbolic rigor, unit consistency, and multi-step causal chain modeling—demonstrating that structured domain knowledge is essential for robust physical reasoning.
📝 Abstract
We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.