Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of existing methods in understanding small-scale and semantically dominant objects in complex video object segmentation. To overcome this challenge, the authors propose Tracking-Enhanced Prompting (TEP), a training-free approach that integrates external object tracking signals with semantic prompts generated by multimodal large language models. These combined prompts are used to enhance the perception of challenging targets within the SAM3 framework. Evaluated on the PVUW 2026 Complex Video Object Segmentation benchmark test set, the proposed method achieves a score of 56.91%, ranking first and demonstrating significant improvement in segmenting difficult cases.
📝 Abstract
In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3's insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.
Problem

Research questions and friction points this paper is trying to address.

Complex Video Object Segmentation
tiny objects
semantic-dominated objects
target comprehension
cluttered environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tracking-Enhanced Prompt
Video Object Segmentation
Multimodal Large Language Model
Training-Free Method
Complex Scene Understanding
🔎 Similar Papers
No similar papers found.
J
Jinrong Zhang
Harbin Institute of Technology, Shenzhen, China
C
Canyang Wu
Harbin Institute of Technology, Shenzhen, China
X
Xusheng He
Harbin Institute of Technology, Shenzhen, China
W
Weili Guan
Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute, China
Jianlong Wu
Jianlong Wu
Professor, Harbin Institute of Technology (Shenzhen)
Computer VisionMultimodal Learning
L
Liqiang Nie
Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute, China