GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

πŸ“… 2026-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the susceptibility of vision-language models (VLMs) to hallucination in counting tasks, where their performance lags significantly behind other visual reasoning capabilities. To mitigate this limitation, the authors propose injecting structured prompts into VLMs in a symbolic manner by leveraging explicit spatial localization from object detection models such as YOLO. This approach enhances the model’s ability to integrate spatial and semantic information, revealing that counting failures primarily stem from insufficient utilization of spatial cues in current architectures and underscoring the necessity of compatibility between enhancement strategies and model design. Experimental results demonstrate substantial improvements: counting accuracy on Ovis2.5-2B reaches 81.3%, a 6.6-percentage-point gain with 22% faster inference, and four out of five mainstream VLMs achieve consistent gains of 6.2–7.5 percentage points.

Technology Category

Application Category

πŸ“ Abstract
Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
Problem

Research questions and friction points this paper is trying to address.

counting hallucinations
vision-language models
object detection
spatial grounding
visual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
object detection
counting hallucination
spatial grounding
prompt-based augmentation
πŸ”Ž Similar Papers
No similar papers found.
B
Boyuan Chen
Tandon School of Engineering, New York University, NY, USA; eBRAIN Lab, Division of Engineering, New York University Abu Dhabi, UAE
M
Minghao Shao
Tandon School of Engineering, New York University, NY, USA; eBRAIN Lab, Division of Engineering, New York University Abu Dhabi, UAE
Siddharth Garg
Siddharth Garg
Institute Associate Professor, New York University
AI/MLHardwareSecurityPrivacy
R
Ramesh Karri
Tandon School of Engineering, New York University, NY, USA
Muhammad Shafique
Muhammad Shafique
Professor, ECE, New York University (AD-UAE, Tandon-USA), Director eBRAIN Lab
Embedded Machine LearningBrain-Inspired ComputingRobust & Energy-Efficient System DesignSmart