STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the hallucination problem in vision-language models for spatiotemporal video grounding, which arises from misalignment between textual and visual coordinates. To circumvent the need for explicit cross-modal coordinate alignment, the authors propose a novel paradigm that reformulates frame-wise coordinate prediction as instance-level ID recognition. They introduce temporally consistent visual prompts and, for the first time, establish a reinforcement learning framework tailored to this task. By leveraging unique instance ID embeddings and a task-driven reward mechanism, the model jointly optimizes temporal accuracy, spatial consistency, and output structure. The proposed method achieves a 20.9% absolute improvement in mIoU on HCSTVG-v2, setting a new state of the art, and further establishes a new SOTA on the MeViS zero-shot multi-object referring expression segmentation benchmark with a 47.3% J&F score.

Technology Category

Application Category

📝 Abstract

In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

hallucination

spatio-temporal video grounding

visual-textual misalignment

dense prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual prompting

instance-level identification

reinforcement learning