MolmoPoint: Better Pointing for VLMs with Grounding Tokens

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing vision-language models rely on coordinate-based grounding for referring expressions, which necessitates modeling complex coordinate systems and incurs substantial textual token overhead, thereby limiting both efficiency and accuracy. This work proposes an intuitive referring mechanism based on hierarchical visual token selection: a dedicated referring token progressively localizes the target region, sub-patch, and precise location through sequential generation, enhanced with relative positional encoding and a “no more points” termination class. The approach achieves a new state-of-the-art performance of 70.7% on PointBench, scores 61.1% among open models on ScreenSpotPro, attains a 59.1% win rate in human preference evaluations for video referring, improves Molmo2Track performance by 6.3%, and significantly enhances sample efficiency.

Technology Category

Application Category

📝 Abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

Problem

Research questions and friction points this paper is trying to address.

grounding

vision-language models

pointing

visual tokens

coordinate generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

grounding tokens

visual language models

fine-grained pointing