Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

It remains unclear whether vision-only models can spontaneously acquire geometric and topological (GT) concepts—akin to human, especially child, cognitive development—without linguistic supervision. Method: We introduce the first standardized psychophysical benchmark covering 43 GT concepts, employing odd-one-out tasks. Human children’s behavioral responses serve as the cognitive alignment ground truth. We systematically evaluate zero-shot transfer performance of CNNs, Vision Transformers (ViTs), and vision-language models (e.g., CLIP). Contribution/Results: Purely visual ViTs outperform children in GT recognition accuracy and exhibit strong rank-order correspondence with children’s difficulty profiles (Pearson *r* = 0.82). In contrast, multimodal models like CLIP underperform significantly (*p* < 0.001), challenging the prevailing assumption that multimodality inherently enhances geometric reasoning. This work provides the first empirical evidence that vision-only systems can autonomously develop human-like GT abstraction capabilities without language supervision.

Technology Category

Application Category

📝 Abstract

With the rapid improvement of machine learning (ML) models, cognitive scientists are increasingly asking about their alignment with how humans think. Here, we ask this question for computer vision models and human sensitivity to geometric and topological (GT) concepts. Under the core knowledge account, these concepts are innate and supported by dedicated neural circuitry. In this work, we investigate an alternative explanation, that GT concepts are learned ``for free'' through everyday interaction with the environment. We do so using computer visions models, which are trained on large image datasets. We build on prior studies to investigate the overall performance and human alignment of three classes of models -- convolutional neural networks (CNNs), transformer-based models, and vision-language models -- on an odd-one-out task testing 43 GT concepts spanning seven classes. Transformer-based models achieve the highest overall accuracy, surpassing that of young children. They also show strong alignment with children's performance, finding the same classes of concepts easy vs. difficult. By contrast, vision-language models underperform their vision-only counterparts and deviate further from human profiles, indicating that na""ive multimodality might compromise abstract geometric sensitivity. These findings support the use of computer vision models to evaluate the sufficiency of the learning account for explaining human sensitivity to GT concepts, while also suggesting that integrating linguistic and visual representations might have unpredicted deleterious consequences.

Problem

Research questions and friction points this paper is trying to address.

Assessing computer vision models' alignment with human geometric and topological sensitivity

Comparing performance of CNNs, transformers, and vision-language models on GT concepts

Investigating if linguistic-visual integration harms abstract geometric reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based models achieve highest accuracy

Vision-language models underperform vision-only models

Computer vision models test learning account sufficiency

🔎 Similar Papers

Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?