PinPoint: Prompting with Informative Interior Points

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing training-free referring expression segmentation methods suffer from significantly inferior performance compared to fine-tuned or reinforcement learning approaches, primarily due to ambiguous prompts such as bounding boxes alone. This work proposes the first training-free, deterministic interior point selection strategy that fuses multiple visual cues—including saliency, edges, superpixels, and depth—to generate a consensus map. By avoiding boundary regions and distractors, the method selects compact yet spatially diverse high-quality interior points as prompts for SAM and leverages a frozen vision-language model (VLM) for semantic annotation. With the same five-point budget, the approach improves cIoU by 12–18 percentage points on RefCOCO/+/g and matches the performance of supervised or reinforcement learning-based fine-tuned methods using only two VLM queries.

📝 Abstract

Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.

Problem

Research questions and friction points this paper is trying to address.

referring image segmentation

prompt ambiguity

interior points

vision-language model

Segment Anything Model

Innovation

Methods, ideas, or system contributions that make the work stand out.

referring image segmentation

prompt engineering

training-free