Forgotten Polygons: Multimodal Large Language Models are Shape-Blind

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work identifies a fundamental limitation of multimodal large language models (MLLMs) in geometric visual reasoning: their accuracy in regular polygon recognition falls below 50%, and they fail to conceptualize “edges” or perform vision-driven logical reasoning. Grounded in dual-process cognitive theory, we diagnose that MLLMs default to intuitive, memory-based pattern matching (System 1) rather than analytical, stepwise reasoning (System 2) for shape identification. To address this, we propose Visual-Cue-guided Chain-of-Thought (VC-CoT), a novel paradigm integrating geometric vision–mathematical benchmarking, human-annotated diagrammatic guidance, and interpretable prompt engineering. Experiments demonstrate that VC-CoT boosts GPT-4o’s accuracy on irregular polygon edge-counting from 7% to 93%, substantially activating latent visual reasoning capabilities. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.

Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle with visual-mathematical reasoning

Poor shape recognition in MLLMs analyzed

VC-CoT improves MLLMs' visual reasoning accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Cued Chain-of-Thought

Enhanced multi-step reasoning

Improved shape recognition accuracy

🔎 Similar Papers

Emergence of a High-Dimensional Abstraction Phase in Language Transformers