🤖 AI Summary
In visual referring localization, existing vision-language models (VLMs) suffer from limited accuracy due to their reliance on single-step inference, falling short of human capabilities in iterative observation and refinement. This paper introduces Poivre, a self-optimizing framework that pioneers the integration of process-based reward modeling into visual localization. It establishes an iterative “predict–visualize–refine” pipeline, wherein the model performs multi-step self-refinement guided by coordinate feedback and optimizes its end-to-end refinement policy via reinforcement learning. Key innovations include a human-inspired self-refinement mechanism and a fine-grained, process-aware reward function explicitly designed for localization dynamics. Experiments demonstrate that Poivre-7B achieves state-of-the-art performance among open-source models on Point-Bench, outperforming Gemini-2.5-Pro and Molmo-72B by over 3%. The code, data, and models are publicly released.
📝 Abstract
Visual pointing, which aims to localize a target by predicting its coordinates on an image, has emerged as an important problem in the realm of vision-language models (VLMs). Despite its broad applicability, recent benchmarks show that current VLMs still fall far behind human performance on this task. A key limitation is that VLMs are typically required to complete the pointing task in a single step, akin to asking humans to point at an object without seeing their own fingers. To address this issue, we propose a simple yet effective self-refining procedure: Point, Visualize, then Refine (Poivre). This procedure enables a VLM to first mark its estimated point, then iteratively refine the coordinates if necessary. Inspired by advances of reasoning models in the natural language domain, we employ reinforcement learning (RL) to incentivize this self-refining ability. For the RL training, we design a neat process reward that is not only empirically effective but also grounded in appealing properties. Our trained model, Poivre-7B, sets a new state of the art on Point-Bench, outperforming both proprietary models such as Gemini-2.5-Pro and large open-source models such as Molmo-72B by over 3%. To support future research, we release our training and inference code, dataset, and the Poivre-7B checkpoint.