The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

πŸ“… 2025-09-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Referential Video Object Segmentation (RVOS) requires joint modeling of linguistic semantics, visual appearance, and motion dynamics in videos; however, existing approaches suffer from sparse frame sampling and the use of a single [SEG] token to represent global temporal structure. This work proposes a multimodal collaborative framework that integrates a multimodal large language model (MLLM) with the video segmentation foundation model SAM2. We design a segmentation-enhancement module to improve per-frame localization accuracy and introduce selective averaging fusion alongside test-time ensembling to explicitly model inter-frame consistency. Our approach overcomes the limitations of sparse sampling and single-token temporal modeling, significantly enhancing robustness in language-guided temporal localization and object tracking. Evaluated on the 7th LSVOS Challenge RVOS track, our method achieves a J&F score of 67.45β€”ranking first and outperforming the second-place method by 2.80 points.

Technology Category

Application Category

πŸ“ Abstract
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $J&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/magic-research/Sa2VA.
Problem

Research questions and friction points this paper is trying to address.

Improving referring video object segmentation via enhanced frame sampling
Addressing single token limitation for entire video segmentation tasks
Enhancing multimodal language models for better video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmentation augmentation to enhance sparse frames
Selective averaging for multi-token video representation
Test-time ensembling improves grounded MLLM performance
πŸ”Ž Similar Papers
No similar papers found.