🤖 AI Summary
This work addresses the challenge faced by existing small-to-medium-scale multimodal language models in accurately interpreting images and effectively integrating visual perception with symbolic reasoning on visually dense mathematical tasks, such as geometry. To this end, we propose SpatialMath, a framework that employs a dedicated spatial-aware module to extract structured spatial representations from diagrams and deeply integrates them into the symbolic reasoning chain, enabling vision-driven, structured mathematical reasoning. We introduce MATHVERSE-PLUS, a new dataset featuring fine-grained visual explanations and annotated reasoning paths, and train our model using supervised fine-tuning and data augmentation strategies. Experimental results demonstrate that our approach achieves up to a 10-percentage-point improvement over strong baselines on visually intensive mathematical reasoning tasks, validating the effectiveness of the proposed joint perception-reasoning architecture.
📝 Abstract
Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning, particularly in geometric problems with diverse levels of visual infusion. Current models struggle to accurately decompose intricate visual inputs and connect perception with structured reasoning, leading to suboptimal performance. To address these challenges, we propose SpatialMath, a novel Spatial Comprehension-Infused Symbolic Reasoning Framework designed to integrate spatial representations into structured symbolic reasoning chains. SpatialMath employs a specialized perception module to extract spatially-grounded representations from visual diagrams, capturing critical geometric structures and spatial relationships. These representations are then methodically infused into symbolic reasoning chains, facilitating visual comprehension-aware structured reasoning. To this end, we introduce MATHVERSE-PLUS, a novel dataset containing structured visual interpretations and step-by-step reasoning paths for vision-intensive mathematical problems. SpatialMath significantly outperforms strong multimodal baselines, achieving up to 10 percentage points improvement over supervised fine-tuning with data augmentation in vision-intensive settings. Robustness analysis reveals that enhanced spatial representations directly improve reasoning accuracy, reinforcing the need for structured perception-to-reasoning pipelines in MSLMs.