The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the task of referring video object segmentation under motion-centric linguistic expressions and proposes the first fully training-free, three-stage approach. The method first leverages Gemini-3.1 Pro to parse natural language instructions and generate discriminative target descriptions. It then employs the SAM3-agent to produce initial masks on keyframes, which are temporally propagated using an official tracker. Finally, Qwen3.5-Plus performs semantic consistency refinement to enhance mask accuracy. This study presents the first integration of powerful multimodal large language models with the SAM3 framework, achieving state-of-the-art performance on the PVUW 2026 MeViS-Text test set with a Final score of 0.9091 and a J&F score of 0.7897.
📝 Abstract
This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026.
Problem

Research questions and friction points this paper is trying to address.

referring video object segmentation
motion-centric language expressions
temporal behavior
object interactions
video grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free pipeline
multimodal large language models
referring video object segmentation
SAM3
motion-centric language expressions
🔎 Similar Papers
No similar papers found.
X
Xusheng He
Harbin Institute of Technology, Shenzhen, China
C
Canyang Wu
Harbin Institute of Technology, Shenzhen, China
J
Jinrong Zhang
Harbin Institute of Technology, Shenzhen, China
W
Weili Guan
Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute, China
Jianlong Wu
Jianlong Wu
Professor, Harbin Institute of Technology (Shenzhen)
Computer VisionMultimodal Learning
L
Liqiang Nie
Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute, China