🤖 AI Summary
Current video understanding research treats high-level semantic tasks (e.g., captioning, question answering) and dense pixel-level tasks (e.g., referring segmentation) in isolation, lacking unified benchmarks and joint modeling frameworks. To address this, we propose ViCaS—the first large-scale benchmark enabling joint evaluation of video-level semantic understanding and language-guided pixel-level segmentation. It comprises thousands of videos annotated with temporally consistent masks, fine-grained natural language descriptions, and phrase-region alignments. We introduce a dual-objective evaluation paradigm and cross-modal alignment metrics, and design a multimodal large language model (MLLM)-based architecture achieving triple alignment—between language, visual content, and spatiotemporal dynamics—via phrase grounding–driven dynamic mask generation and temporal consistency constraints. Experiments demonstrate that our method achieves synergistic improvements across video captioning, referring segmentation, and cross-modal retrieval, significantly outperforming single-task baselines.
📝 Abstract
Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: https://ali2500.github.io/vicas-project/