SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track

πŸ“… 2026-03-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges of inaccurate localization in video referring expression segmentation, particularly when motion descriptions are complex and no explicit target query is provided. Building upon the Sa2VA backbone, the proposed method extends the input frame sequence and incorporates a [SEG] token, while introducing a novel target existence-aware verification mechanism. This mechanism employs a concise yet effective strategy to assess target presence, significantly enhancing the model’s understanding of motion-centric semantics and its robustness in handling queries without explicit targets. The approach achieved second place in the MeViS-Text track of the 5th PVUW Challenge with a score of 89.19, demonstrating superior segmentation performance under complex referring expressions.
πŸ“ Abstract
Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.
Problem

Research questions and friction points this paper is trying to address.

referring video object segmentation
motion-centric expressions
no-target queries
video grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

target existence-aware verification
referring video object segmentation
motion-centric expressions
no-target queries
SaSaSa2VA
πŸ”Ž Similar Papers
No similar papers found.