ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current video understanding research treats high-level semantic tasks (e.g., captioning, question answering) and dense pixel-level tasks (e.g., referring segmentation) in isolation, lacking unified benchmarks and joint modeling frameworks. To address this, we propose ViCaS—the first large-scale benchmark enabling joint evaluation of video-level semantic understanding and language-guided pixel-level segmentation. It comprises thousands of videos annotated with temporally consistent masks, fine-grained natural language descriptions, and phrase-region alignments. We introduce a dual-objective evaluation paradigm and cross-modal alignment metrics, and design a multimodal large language model (MLLM)-based architecture achieving triple alignment—between language, visual content, and spatiotemporal dynamics—via phrase grounding–driven dynamic mask generation and temporal consistency constraints. Experiments demonstrate that our method achieves synergistic improvements across video captioning, referring segmentation, and cross-modal retrieval, significantly outperforming single-task baselines.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: https://ali2500.github.io/vicas-project/

Problem

Research questions and friction points this paper is trying to address.

Unifies high-level video understanding with pixel-level segmentation tasks

Introduces ViCaS dataset for holistic and precise video analysis

Proposes evaluation metrics and model for combined video comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines holistic and pixel-level video understanding

Uses captions with grounded segmentation annotations

Proposes unified model architecture for dual tasks

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding