PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of pixel-level video understanding in complex natural scenes, focusing on two key tasks: multi-object semantic video segmentation (MOSE) and motion–language co-guided video segmentation (MeViS). We propose the first unified multimodal video segmentation framework that systematically integrates motion priors with language instructions. Our method introduces spatiotemporal adaptive attention, a dynamic mask decoder, and a cross-modal alignment loss, complemented by weakly and semi-supervised training strategies. To benchmark real-world robustness, we introduce PVUW-2025—the most challenging in-the-wild video segmentation benchmark to date—designed to stress-test performance under occlusion, small objects, and long temporal horizons. Our approach achieves new state-of-the-art results on both MOSE and MeViS benchmarks, improving the average J&F score by 8.3% over prior methods. These results empirically validate the critical role of joint motion modeling and language grounding for open-domain video understanding.

Technology Category

Application Category

📝 Abstract

This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Advancing pixel-level understanding of complex wild videos

Evaluating video object segmentation in challenging real-world scenarios

Exploring motion-guided language-based video segmentation techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Complex scene video object segmentation

Motion-guided language-based video segmentation

New challenging real-world datasets

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding