🤖 AI Summary
This work addresses the tendency of multimodal large language models to conflate coincidentally correct answers with rigorously derived reasoning in geometric tasks due to outcome-oriented supervision. To remedy this, the authors propose replacing conventional supervision with subgoal-level evaluation, introducing GeoGoal—the first verifiable benchmark for geometric reasoning—and a skeleton-rate-based dense reward mechanism, SGVR, to guide models toward learning formally verifiable reasoning paths. Notably, this study is the first to incorporate numerical subgoals generated by formal verification into model training, revealing a critical misalignment between reasoning quality and answer accuracy. Experimental results demonstrate that the proposed approach improves performance by 9.7% on geometric reasoning tasks and exhibits strong generalization, yielding gains of 8.0% on general mathematical tasks and 2.8% on other reasoning benchmarks.
📝 Abstract
Multimodal Large Language Models (MLLMs) struggle with complex geometric reasoning, largely because"black box"outcome-based supervision fails to distinguish between lucky guesses and rigorous deduction. To address this, we introduce a paradigm shift towards subgoal-level evaluation and learning. We first construct GeoGoal, a benchmark synthesized via a rigorous formal verification data engine, which converts abstract proofs into verifiable numeric subgoals. This structure reveals a critical divergence between reasoning quality and outcome accuracy. Leveraging this, we propose the Sub-Goal Verifiable Reward (SGVR) framework, which replaces sparse signals with dense rewards based on the Skeleton Rate. Experiments demonstrate that SGVR not only enhances geometric performance (+9.7%) but also exhibits strong generalization, transferring gains to general math (+8.0%) and other general reasoning tasks (+2.8%), demonstrating broad applicability across diverse domains.