SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenges of inaccurate localization in video referring expression segmentation, particularly when motion descriptions are complex and no explicit target query is provided. Building upon the Sa2VA backbone, the proposed method extends the input frame sequence and incorporates a [SEG] token, while introducing a novel target existence-aware verification mechanism. This mechanism employs a concise yet effective strategy to assess target presence, significantly enhancing the model’s understanding of motion-centric semantics and its robustness in handling queries without explicit targets. The approach achieved second place in the MeViS-Text track of the 5th PVUW Challenge with a score of 89.19, demonstrating superior segmentation performance under complex referring expressions.

Technology Category

Application Category

📝 Abstract

Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.

Problem

Research questions and friction points this paper is trying to address.

referring video object segmentation

motion-centric expressions

no-target queries

video grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

target existence-aware verification

referring video object segmentation

motion-centric expressions

no-target queries