π€ AI Summary
Referential Video Object Segmentation (RVOS) requires joint modeling of linguistic semantics, visual appearance, and motion dynamics in videos; however, existing approaches suffer from sparse frame sampling and the use of a single [SEG] token to represent global temporal structure. This work proposes a multimodal collaborative framework that integrates a multimodal large language model (MLLM) with the video segmentation foundation model SAM2. We design a segmentation-enhancement module to improve per-frame localization accuracy and introduce selective averaging fusion alongside test-time ensembling to explicitly model inter-frame consistency. Our approach overcomes the limitations of sparse sampling and single-token temporal modeling, significantly enhancing robustness in language-guided temporal localization and object tracking. Evaluated on the 7th LSVOS Challenge RVOS track, our method achieves a J&F score of 67.45βranking first and outperforming the second-place method by 2.80 points.
π Abstract
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $J&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/magic-research/Sa2VA.