ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

πŸ“… 2025-03-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenge in referring video object segmentation (RVOS) of simultaneously achieving high segmentation accuracy and strong temporal consistency across both single- and multi-object scenarios. To this end, we propose a novel framework that synergistically integrates ReferDINO’s cross-modal grounding capability with SAM2’s high-fidelity mask generation. Our key contributions are: (1) the first incorporation of SAM2’s temporal modeling capacity into RVOS; (2) a conditional dynamic mask fusion strategy that adaptively balances robust single-object localization and effective multi-object disentanglement; and (3) fine-tuning on the MeViS dataset to enhance text-video alignment fidelity. Evaluated on the MeViS test set, our method achieves a J&F score of 60.43, ranking second in the CVPR 2025 PVUW MeViS Challenge.

Technology Category

Application Category

πŸ“ Abstract
Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 (mathcal{J}&mathcal{F}) on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.
Problem

Research questions and friction points this paper is trying to address.

Segment video objects using text descriptions accurately
Improve mask quality and object consistency in RVOS
Balance performance in single and multi-object scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates SAM2 for better mask quality
Uses conditional mask fusion strategy
Adapts object-level vision-language knowledge
πŸ”Ž Similar Papers
No similar papers found.