3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation in language-guided video object segmentation caused by inconsistencies between training and inference paradigms in the Sa2VA model, this paper proposes Sa2VA-i—an improved architecture that unifies training and inference under a language-guided dense localization framework, thereby eliminating misalignments in cross-modal alignment and temporal modeling. The core innovations include a consistent multi-stage prompt fusion strategy and a cross-frame attention mechanism, enabling the lightweight Sa2VA-i-1B variant (1B parameters) to match the performance of the original 26B-parameter model. Sa2VA-i establishes new state-of-the-art results across four major benchmarks—MeViS, Ref-YT-VOS, Ref-DAVIS, and ReVOS—with a +11.6 gain in the J&F metric on MeViS. These advances significantly enhance the efficiency and robustness of referring-expression-driven video segmentation.

Technology Category

Application Category

📝 Abstract
Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i
Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistent training and inference in Sa2VA model
Improves referring video object segmentation performance
Enhances dense grounding accuracy across multiple video benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistent training and inference procedures
Improved Sa2VA model for video segmentation
Rectifying inconsistencies to boost performance
🔎 Similar Papers
No similar papers found.