Getting to the Point: Why Pointing Improves LVLMs

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study investigates how referential mechanisms enhance the accuracy and interpretability of large vision-language models (LVLMs) on cognitive tasks such as zero-shot counting. To this end, the authors propose a “Point-then-Count” strategy that decouples visual grounding and reasoning into an explicit two-stage process: first predicting the spatial coordinates of target objects, then generating answers conditioned on these coordinates. Experimental results demonstrate that this approach substantially improves out-of-distribution generalization, with over 89% of predicted points accurately grounded in the image. Furthermore, coordinate-based encoding effectively mitigates task overfitting. This work presents the first systematic evaluation of the spatial reliability and bias of predicted points in LVLMs, revealing that modeling spatial relationships is a key mechanism through which referential cues boost performance.

Technology Category

Application Category

📝 Abstract

Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.

Problem

Research questions and friction points this paper is trying to address.

pointing

Large Vision-Language Models

zero-shot counting

visual grounding

spatial bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

pointing

Large Vision-Language Models

zero-shot counting