π€ AI Summary
This work addresses the spatial ambiguity and referential imprecision commonly observed in video referring segmentation methods based on large vision-language models (LVLMs), which stem from a mismatch between the modelβs optimization objective and the grounding task. To resolve this, the authors propose an input-level attention modulation mechanism that leverages learnable soft prompts and reasoning-oriented chain-of-thought guidance to produce sharper, more focused attention maps. These maps are then converted into point-based prompts on key frames to drive a frozen segmentation model. Notably, the approach requires no fine-tuning of either the LVLM or the segmentation network, training only on Ref-YouTube-VOS while generalizing effectively to multiple benchmarks. Under the challenging zero-shot video segmentation setting, it achieves substantially improved spatial grounding accuracy, outperforming current state-of-the-art methods.
π Abstract
Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io