🤖 AI Summary
This work addresses the challenges of pixel-level video understanding in highly unconstrained real-world scenarios—such as dense occlusions, ambiguous linguistic references, and sound source localization—by proposing a multimodal fusion framework that integrates visual, textual, and audio signals to enable object tracking, language-guided segmentation, and audio-driven object segmentation. To advance research in this domain, the study introduces a challenging benchmark comprising three tracks: MOSE, MeViS-Text, and the novel MeViS-Audio, with the latter establishing the first audio-driven pixel-level segmentation task, thereby significantly expanding the modal boundaries of video understanding. Experimental results demonstrate the effectiveness of multimodal approaches in complex real-world videos, offering a new benchmark and promising directions for robust video scene understanding.
📝 Abstract
This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community's latest technical advancements and charts promising future directions for robust video scene comprehension.