Counting to Four is still a Chore for VLMs

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
This work addresses the poor performance of vision-language models (VLMs) on basic counting tasks and the lack of diagnostic tools to pinpoint failure causes. The authors introduce COUNTINGTRICKS, a controlled evaluation benchmark, and combine attention analysis, component-level probing, and adversarial prompting to systematically investigate VLM counting failures. Their analysis reveals that visual evidence is strongest at the modality projection stage but significantly diminishes in subsequent language reasoning layers, leading models to over-rely on textual priors. To mitigate this, they propose Modality Attention Share (MAS), a lightweight intervention that enforces a minimal level of visual attention during token generation. Experiments demonstrate that MAS effectively reduces interference from textual priors and substantially improves counting accuracy.

Technology Category

Application Category

📝 Abstract
Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
object counting
multimodal reasoning
visual grounding
model failure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Object Counting
Attention Analysis
Modality Attention Share
Mechanistic Interpretability