Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a fundamental deficiency in current vision-language models (VLMs) regarding compositional object counting: while VLMs achieve high accuracy in single-shape scenarios, their performance degrades significantly in compositional settings involving multiple geometric shapes. To rigorously evaluate this limitation, the authors introduce VLMCountBench—the first benchmark specifically designed for compositional counting—built upon controlled, minimalist geometric images. Through systematic ablation studies varying color, scale, and prompting strategies, the study provides the first empirical confirmation that VLMs fail basic compositional generalization, and further demonstrates that state-of-the-art prompting techniques cannot mitigate this failure. Beyond delivering a reproducible evaluation framework, this work reveals a deep bottleneck in VLMs’ visual-semantic alignment from a compositional perspective, exposing critical gaps in their capacity to reason about structured visual configurations. These findings offer concrete guidance for advancing model architectures and training paradigms.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have become a central focus of today's AI community, owing to their impressive abilities gained from training on large-scale vision-language data from the Web. These models have demonstrated strong performance across diverse tasks, including image understanding, video understanding, complex visual reasoning, and embodied AI. Despite these noteworthy successes, a fundamental question remains: Can VLMs count objects correctly? In this paper, we introduce a simple yet effective benchmark, VLMCountBench, designed under a minimalist setting with only basic geometric shapes (e.g., triangles, circles) and their compositions, focusing exclusively on counting tasks without interference from other factors. We adopt strict independent variable control and systematically study the effects of simple properties such as color, size, and prompt refinement in a controlled ablation. Our empirical results reveal that while VLMs can count reliably when only one shape type is present, they exhibit substantial failures when multiple shape types are combined (i.e., compositional counting). This highlights a fundamental empirical limitation of current VLMs and motivates important directions for future research.
Problem

Research questions and friction points this paper is trying to address.

Exposing VLMs' failure in counting multiple object types
Benchmarking compositional counting with geometric shapes
Identifying limitations in VLM object counting capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed minimalist benchmark VLMCountBench for counting
Used controlled ablation with geometric shapes variables
Exposed VLM failures in compositional counting tasks
🔎 Similar Papers
No similar papers found.