Getting to the Point: Why Pointing Improves LVLMs

πŸ“… 2026-03-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates how referential mechanisms enhance the accuracy and interpretability of large vision-language models (LVLMs) on cognitive tasks such as zero-shot counting. To this end, the authors propose a β€œPoint-then-Count” strategy that decouples visual grounding and reasoning into an explicit two-stage process: first predicting the spatial coordinates of target objects, then generating answers conditioned on these coordinates. Experimental results demonstrate that this approach substantially improves out-of-distribution generalization, with over 89% of predicted points accurately grounded in the image. Furthermore, coordinate-based encoding effectively mitigates task overfitting. This work presents the first systematic evaluation of the spatial reliability and bias of predicted points in LVLMs, revealing that modeling spatial relationships is a key mechanism through which referential cues boost performance.

Technology Category

Application Category

πŸ“ Abstract
Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
Problem

Research questions and friction points this paper is trying to address.

pointing
Large Vision-Language Models
zero-shot counting
visual grounding
spatial bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

pointing
Large Vision-Language Models
zero-shot counting
spatial grounding
out-of-distribution generalization
πŸ”Ž Similar Papers
No similar papers found.
S
Simone Alghisi
Signals and Interactive Systems Lab, University of Trento, Povo TN 38123, IT
M
Massimo Rizzoli
Signals and Interactive Systems Lab, University of Trento, Povo TN 38123, IT
S
Seyed Mahed Mousavi
Signals and Interactive Systems Lab, University of Trento, Povo TN 38123, IT
Giuseppe Riccardi
Giuseppe Riccardi
Professor of Computer Science, University of Trento Italy
Natural Language ProcessingSpeech ProcessingDialogueMachine Learning