🤖 AI Summary
It remains unclear whether vision-only models can spontaneously acquire geometric and topological (GT) concepts—akin to human, especially child, cognitive development—without linguistic supervision.
Method: We introduce the first standardized psychophysical benchmark covering 43 GT concepts, employing odd-one-out tasks. Human children’s behavioral responses serve as the cognitive alignment ground truth. We systematically evaluate zero-shot transfer performance of CNNs, Vision Transformers (ViTs), and vision-language models (e.g., CLIP).
Contribution/Results: Purely visual ViTs outperform children in GT recognition accuracy and exhibit strong rank-order correspondence with children’s difficulty profiles (Pearson *r* = 0.82). In contrast, multimodal models like CLIP underperform significantly (*p* < 0.001), challenging the prevailing assumption that multimodality inherently enhances geometric reasoning. This work provides the first empirical evidence that vision-only systems can autonomously develop human-like GT abstraction capabilities without language supervision.
📝 Abstract
With the rapid improvement of machine learning (ML) models, cognitive scientists are increasingly asking about their alignment with how humans think. Here, we ask this question for computer vision models and human sensitivity to geometric and topological (GT) concepts. Under the core knowledge account, these concepts are innate and supported by dedicated neural circuitry. In this work, we investigate an alternative explanation, that GT concepts are learned ``for free'' through everyday interaction with the environment. We do so using computer visions models, which are trained on large image datasets. We build on prior studies to investigate the overall performance and human alignment of three classes of models -- convolutional neural networks (CNNs), transformer-based models, and vision-language models -- on an odd-one-out task testing 43 GT concepts spanning seven classes. Transformer-based models achieve the highest overall accuracy, surpassing that of young children. They also show strong alignment with children's performance, finding the same classes of concepts easy vs. difficult. By contrast, vision-language models underperform their vision-only counterparts and deviate further from human profiles, indicating that na""ive multimodality might compromise abstract geometric sensitivity. These findings support the use of computer vision models to evaluate the sufficiency of the learning account for explaining human sensitivity to GT concepts, while also suggesting that integrating linguistic and visual representations might have unpredicted deleterious consequences.