π€ AI Summary
This study addresses the susceptibility of vision-language models (VLMs) to hallucination in counting tasks, where their performance lags significantly behind other visual reasoning capabilities. To mitigate this limitation, the authors propose injecting structured prompts into VLMs in a symbolic manner by leveraging explicit spatial localization from object detection models such as YOLO. This approach enhances the modelβs ability to integrate spatial and semantic information, revealing that counting failures primarily stem from insufficient utilization of spatial cues in current architectures and underscoring the necessity of compatibility between enhancement strategies and model design. Experimental results demonstrate substantial improvements: counting accuracy on Ovis2.5-2B reaches 81.3%, a 6.6-percentage-point gain with 22% faster inference, and four out of five mainstream VLMs achieve consistent gains of 6.2β7.5 percentage points.
π Abstract
Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.