Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This paper addresses key challenges in referring video object segmentation (RVOS): insufficient vision-language alignment, weak long-term temporal modeling, and early-stage object omission. We propose a zero-shot, lightweight framework. Our method introduces: (1) a cross-modal collaboration mechanism between a large language model (LLM) and SAM 2 to enable fine-grained language-guided visual segmentation; (2) a video-language consistency verification module that dynamically filters false positives; and (3) an adaptive keyframe sampling strategy to enhance modeling of long-range temporal context and early-appearing objects. Evaluated on the MeViS test set, our approach achieves a J&F score of 64.14%, ranking second in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

Technology Category

Application Category

📝 Abstract

Referential Video Object Segmentation (RVOS) aims to segment all objects in a video that match a given natural language description, bridging the gap between vision and language understanding. Recent work, such as Sa2VA, combines Large Language Models (LLMs) with SAM~2, leveraging the strong video reasoning capability of LLMs to guide video segmentation. In this work, we present a training-free framework that substantially improves Sa2VA's performance on the RVOS task. Our method introduces two key components: (1) a Video-Language Checker that explicitly verifies whether the subject and action described in the query actually appear in the video, thereby reducing false positives; and (2) a Key-Frame Sampler that adaptively selects informative frames to better capture both early object appearances and long-range temporal context. Without any additional training, our approach achieves a J&F score of 64.14% on the MeViS test set, ranking 2nd place in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

Problem

Research questions and friction points this paper is trying to address.

Improving RVOS by reducing false positive segmentations

Enhancing video-language alignment without retraining models

Adaptively selecting key frames for temporal context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework enhances Sa2VA

Video-Language Checker reduces false positives

Key-Frame Sampler captures temporal context

🔎 Similar Papers

No similar papers found.