MMFormalizer: Multimodal Autoformalization in the Wild

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing approaches to automated formalization struggle with multimodal mathematical problems involving implicit visual and physical constraints—such as mass and energy. This work proposes a perception-anchored recursive construction mechanism that integrates natural language and visual information through adaptive entity anchoring, recursive proposition synthesis, and axiom composition, grounded jointly by dimensional analysis and axiomatic consistency to yield physically meaningful formal statements. The method achieves the first multimodal automated formalization across domains including classical mechanics, relativity, quantum mechanics, and thermodynamics. Evaluated on the newly introduced PhyX-AF benchmark, GPT-5 and Gemini-3-Pro demonstrate superior performance, with GPT-5 leading in physical reasoning tasks, while geometric formalization remains notably challenging.

Technology Category

Application Category

📝 Abstract

Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io

Problem

Research questions and friction points this paper is trying to address.

autoformalization

multimodal reasoning

visual grounding

physical constraints

formal mathematics

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal autoformalization

adaptive grounding

recursive grounding