PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of pixel-level video understanding in complex natural scenes, focusing on two key tasks: multi-object semantic video segmentation (MOSE) and motion–language co-guided video segmentation (MeViS). We propose the first unified multimodal video segmentation framework that systematically integrates motion priors with language instructions. Our method introduces spatiotemporal adaptive attention, a dynamic mask decoder, and a cross-modal alignment loss, complemented by weakly and semi-supervised training strategies. To benchmark real-world robustness, we introduce PVUW-2025—the most challenging in-the-wild video segmentation benchmark to date—designed to stress-test performance under occlusion, small objects, and long temporal horizons. Our approach achieves new state-of-the-art results on both MOSE and MeViS benchmarks, improving the average J&F score by 8.3% over prior methods. These results empirically validate the critical role of joint motion modeling and language grounding for open-domain video understanding.

Technology Category

Application Category

📝 Abstract
This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Advancing pixel-level understanding of complex wild videos
Evaluating video object segmentation in challenging real-world scenarios
Exploring motion-guided language-based video segmentation techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Complex scene video object segmentation
Motion-guided language-based video segmentation
New challenging real-world datasets
🔎 Similar Papers
No similar papers found.
Henghui Ding
Henghui Ding
Fudan University
Computer VisionMachine LearningSegmentationAIGC
C
Chang Liu
Nikhila Ravi
Nikhila Ravi
Meta AI Research
Shuting He
Shuting He
Assistant Professor, Shanghai University of Finance and Economics
Computer Vision
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning
S
Song Bai
Philip Torr
Philip Torr
Professor, University of Oxford
Department of Engineering
K
Kehuan Song
X
Xinglin Xie
Kexin Zhang
Kexin Zhang
Tsinghua University
Data MiningMachine Learning
Licheng Jiao
Licheng Jiao
Distinguished Professor of Xidian University, IEEE Fellow
Neural NetworksComputational IntelligenceEvolutionary ComputationRemote SensingPattern Recognition.
Lingling Li
Lingling Li
Associate Director of Biostatistics, Sanofi Genzyme
Causal inferencemissing datapropensity scoresequential analytic methodsdrug and vaccine safety
Shuyuan Yang
Shuyuan Yang
Xidian University
Professor
X
Xuqiang Cao
L
Linnan Zhao
Jiaxuan Zhao
Jiaxuan Zhao
Xidian University
F
Fang Liu
M
Mengjiao Wang
J
Junpei Zhang
X
Xu Liu
Y
Yuting Yang
Mengru Ma
Mengru Ma
xidian university
Fusion Classification,Remote Sensing Intelligent Interpretation
H
Hao Fang
R
Runmin Cong
X
Xiankai Lu
Z
Zhiyang Che
Wei Zhan
Wei Zhan
Co-Director of Berkeley DeepDrive, UC Berkeley; Chief Scientist of Applied Intuition
AI for autonomous systems
Tianming Liang
Tianming Liang
Sun Yat-sen University
Video-language understanding
H
Haichao Jiang
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning
Jian-Fang Hu
Jian-Fang Hu
Sun Yat-sen University
Computer Vision and Machine Learning
Haobo Yuan
Haobo Yuan
UC Merced
Computer VisionDeep Learning
Xiangtai Li
Xiangtai Li
Research Scientist, Tiktok, SG; MMLab@NTU
Generative AIComputer Vision
T
Tao Zhang
Lu Qi
Lu Qi
Insta360 | Wuhan Univeristy
Computer VisionDeep Learning
Ming-Hsuan Yang
Ming-Hsuan Yang
University of California at Merced; Google DeepMind
Computer VisionMachine LearningArtificial Intelligence