🤖 AI Summary
This work addresses the challenge of pixel-level video understanding in complex natural scenes, focusing on two key tasks: multi-object semantic video segmentation (MOSE) and motion–language co-guided video segmentation (MeViS). We propose the first unified multimodal video segmentation framework that systematically integrates motion priors with language instructions. Our method introduces spatiotemporal adaptive attention, a dynamic mask decoder, and a cross-modal alignment loss, complemented by weakly and semi-supervised training strategies. To benchmark real-world robustness, we introduce PVUW-2025—the most challenging in-the-wild video segmentation benchmark to date—designed to stress-test performance under occlusion, small objects, and long temporal horizons. Our approach achieves new state-of-the-art results on both MOSE and MeViS benchmarks, improving the average J&F score by 8.3% over prior methods. These results empirically validate the critical role of joint motion modeling and language grounding for open-domain video understanding.
📝 Abstract
This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.