Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit poor performance on visual arithmetic tasks—e.g., counting and length comparison—hindering their applicability to downstream tasks such as chart understanding and geometric reasoning. We identify that the visual encoder retains sufficient information, while the bottleneck lies in the text decoder’s lack of arithmetic reasoning capability. To address this, we propose CogAlign, a post-training method grounded in Piaget’s theory of cognitive development, which enforces representational invariance under visual transformations via multi-stage contrastive alignment and cognition-driven representational constraints—requiring no labeled data. Evaluated on our newly constructed Visual Arithmetic Probe benchmark, CogAlign improves CHOCOLATE and MATH-VISION by +4.6% and +2.9% on average, matching supervised fine-tuning performance while using only 40% of its training data. This work is the first to systematically integrate developmental psychology principles into VLM post-training, enabling robust enhancement of fundamental arithmetic capabilities and strong transfer to complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the performance of three diverse VLMs on our proposed probing tasks. Furthermore, CogAlign enhances performance by an average of 4.6% on CHOCOLATE and 2.9% on MATH-VISION, outperforming or matching supervised fine-tuning methods while requiring only 60% less training data. These results highlight the effectiveness and generalizability of CogAlign in improving fundamental visual arithmetic capabilities and their transfer to downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance Visual Arithmetic in Vision Language Models

Improve Chart and Geometry Understanding

Address Decoder Failure in Arithmetic Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

CogAlign post-training strategy

Recognizes invariant visual properties

Improves VLM arithmetic capabilities

🔎 Similar Papers

No similar papers found.