π€ AI Summary
This work addresses the challenges of inaccurate localization in video referring expression segmentation, particularly when motion descriptions are complex and no explicit target query is provided. Building upon the Sa2VA backbone, the proposed method extends the input frame sequence and incorporates a [SEG] token, while introducing a novel target existence-aware verification mechanism. This mechanism employs a concise yet effective strategy to assess target presence, significantly enhancing the modelβs understanding of motion-centric semantics and its robustness in handling queries without explicit targets. The approach achieved second place in the MeViS-Text track of the 5th PVUW Challenge with a score of 89.19, demonstrating superior segmentation performance under complex referring expressions.
π Abstract
Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.