🤖 AI Summary
This work identifies a critical limitation of vision-language models (VLMs) in visual mathematical equation solving: VLMs exhibit substantially degraded performance when equations are presented as images, variables are represented by object icons, and coefficients must be inferred via object counting—contrasting sharply with strong performance on textual equations. To systematically diagnose this failure, the authors introduce the first visual equation solving benchmark, decomposing multi-step cross-modal reasoning into three sequential subtasks—icon recognition, numerical counting, and symbolic solving. Through rigorous ablation, they pinpoint counting accuracy as the primary bottleneck and demonstrate that recognition errors propagate and amplify along the reasoning chain. Experiments reveal not only low counting fidelity but also progressive deterioration of symbolic reasoning capability with increasing equation complexity. This study provides the first quantitative characterization of perceptual–symbolic decoupling in foundational visual mathematical reasoning, establishing a new paradigm and empirical foundation for interpretable VLM evaluation and architectural refinement.
📝 Abstract
Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.