🤖 AI Summary
Large Vision-Language Models (LVLMs) face critical limitations in Geometry Problem Solving (GPS): unreliable diagram understanding, opaque reasoning processes, and insufficient intermediate steps in formal program generation. To address these, we propose an interpretable reasoning framework that *interleaves* natural-language chain-of-thought reasoning with executable formal code generation, yielding progressive, verifiable reasoning paths. We further introduce a symbolic computation solver–guided reinforcement learning paradigm, integrated with supervised fine-tuning, to train a Qwen2.5-VL-7B model on a novel 11K-sample synthetic GPS dataset. Experiments demonstrate state-of-the-art performance on standard GPS benchmarks—achieving up to a 15% absolute accuracy gain—outperforming both same-scale and significantly larger models (e.g., Qwen2.5-VL-72B). Moreover, our generated reasoning traces are more concise and formally verifiable, enhancing transparency and trustworthiness.
📝 Abstract
Large vision language models exhibit notable limitations on Geometry Problem Solving (GPS) because of their unreliable diagram interpretation and pure natural-language reasoning. A recent line of work mitigates this by using symbolic solvers: the model directly generates a formal program that a geometry solver can execute. However, this direct program generation lacks intermediate reasoning, making the decision process opaque and prone to errors. In this work, we explore a new approach that integrates Chain-of-Thought (CoT) with formal language. The model interleaves natural language reasoning with incremental emission of solver-executable code, producing a hybrid reasoning trace in which critical derivations are expressed in formal language. To teach this behavior at scale, we combine (1) supervised fine-tuning on an 11K newly developed synthetic dataset with interleaved natural language reasoning and automatic formalization, and (2) solver-in-the-loop reinforcement learning that jointly optimizes both the CoT narrative and the resulting program through outcome-based rewards. Built on Qwen2.5-VL-7B, our new model, named GF-Reasoner, achieves up to 15% accuracy improvements on standard GPS benchmarks, surpassing both 7B-scale peers and the much larger model Qwen2.5-VL-72B. By exploiting high-order geometric knowledge and offloading symbolic computation to the solver, the generated reasoning traces are noticeably shorter and cleaner. Furthermore, we present a comprehensive analysis of method design choices (e.g., reasoning paradigms, data synthesis, training epochs, etc.), providing actionable insights for future research.