4th PVUW MeViS 3rd Place Report: Sa2VA

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the motion-aware referring video object segmentation (RVOS) task, targeting the language–vision misalignment challenge induced by dynamic target motion in the MeViS dataset. We propose a lightweight inference-time optimization method built upon the unified dense alignment model Sa2VA, integrated with a multimodal large language model (MLLM). Crucially, we introduce a test-time keyframe expansion strategy: expanding only the keyframe sampling window without any parameter fine-tuning. This enhances temporal grounding of motion-related linguistic expressions and improves segmentation accuracy. Our approach achieves state-of-the-art performance on the MeViS benchmark and ranked third in the 4th PVUW MeViS Challenge. Results demonstrate that temporal enhancement via inference-time keyframe expansion—without parameter updates—is both effective and practical for complex motion-aware RVOS.

Technology Category

Application Category

📝 Abstract

Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.

Problem

Research questions and friction points this paper is trying to address.

Segmenting video objects using language descriptions

Improving RVOS with multi-modal large language models

Enhancing key frame scope for better MeViS performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified test time inference on MLLMs

Adopted Sa2VA for dense grounded understanding

Enlarged key frame scope without training

🔎 Similar Papers

No similar papers found.