Visual Enumeration is Challenging for Large-scale Generative AI

📅 2024-01-09
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the visual number sense—the capacity to rapidly estimate small quantities (1–10 objects)—in multimodal large language models (MLLMs), revealing severe deficits far below human and non-human animal baseline performance. Method: We propose the first zero-shot, prompt-driven dual-task evaluation framework (recognition + generation), applicable to both leading open- and closed-source MLLMs. Contribution/Results: (1) Most open-source MLLMs exhibit >40% error rates on small numerosities (3–5); (2) Error patterns are strongly category-dependent and violate Weber’s law—challenging the “scale-as-ability” hypothesis; (3) Only the latest closed-source systems show nascent human-like number sense. This work provides the first empirical evidence that visual numerosity representation constitutes a critical bottleneck in AI perceptual grounding, establishing a new benchmark and theoretical foundation for multimodal cognitive modeling and model improvement.

Technology Category

Application Category

📝 Abstract
Humans can readily judge the number of objects in a visual scene, even without counting, and such a skill has been documented in many animal species and babies prior to language development and formal schooling. Numerical judgments are error-free for small sets, while for larger collections responses become approximate, with variability increasing proportionally to the target number. This response pattern is observed for items of all kinds, despite variation in object features (such as color or shape), suggesting that our visual number sense relies on abstract representations of numerosity. Here, we investigate whether large-scale generative Artificial Intelligence (AI) systems have a human-like number sense, which should allow them to reliably name the number of objects in simple visual stimuli or generate images containing a target number of items in the 1-10 range. Surprisingly, most of the foundation models considered have a poor number sense: They make striking errors even with small numbers, the response variability does not increase in a systematic way, and the pattern of errors depends on object category. Only the most recent proprietary systems exhibit signatures of a visual number sense. Our findings demonstrate that having an intuitive visual understanding of number remains challenging for foundation models, which in turn might be detrimental to the perceptual grounding of numeracy that in humans is crucial for mathematical learning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI's visual enumeration skills via benchmark tasks
Assessing number sense in multimodal foundation models
Testing accuracy in counting objects and generating target numerosity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark tasks for visual enumeration evaluation
Testing multimodal models on number sense
Open-source code for future AI assessment
🔎 Similar Papers
No similar papers found.