Unveiling the Visual Counting Bottleneck in Vision-Language Models

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the severe extrapolation failures of vision-language models in systematic generalization tasks such as visual counting. The authors propose the “disconnected numerical representation” hypothesis, decomposing visual counting into three stages: visual individuation, numerosity perception, and symbolic mapping. Through synthetic Go board datasets, linear probing analyses, and cross-modal reasoning tasks, they diagnose the internal representations of mainstream foundation models. Their findings reveal that while visual backbones robustly encode numerosity information—preserving perceptual capabilities—the symbolic mapping stage fails, preventing alignment across modalities in numerical space. These results suggest that scaling data alone is insufficient to overcome this bottleneck, highlighting a fundamental limitation in current models’ ability to achieve abstract symbolic alignment.

📝 Abstract

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

Problem

Research questions and friction points this paper is trying to address.

visual counting

systematic generalization

vision-language models

extrapolation bottleneck

symbolic mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual counting

systematic generalization

symbolic mapping